Is it possible to do a compound sort in solr without Field Collapsing?
If I have two car models, Ford and Chevy, can I sort first on Ford where price is less than 2,000, then Ford > 2,000, then the Chevy models? I would like to do this without grouping, and without applying a price sort to the Chevy models.
For example, something like &sort=Model:"Ford" AND price:[0 TO 2000]
so that I get:
Ford 1, $1000
Ford 2, $500
Ford 2, $1500
_________
Ford 3, $3000
Ford 3, $5000
_______
Chevy 1
Chevy 2
Chevy 3
I've tinkered a bit with this, and I've come up with a solution based on the query() function, since you can use that together with sorting. I'm not sure about the performance, and depending on the number of documents in your index, that might not be important, so the only way is to try it and see if it performs. I've used name and price as my two fields in the schema, which I think would map to your Model and price fields.
The way sort works is that each clause is evaluated in order, so that the first sort description is performed first, then the next one if there's a draw, and so on.
I've removed url escaping and formatted everything a bit:
sort=query($sq1,0) asc,query($sq2,0) asc
&sq1=name:Ford* AND price:[0 TO 1500]
&sq2=name:Ford*
This implies that the first sort is performed on the query named in the sq1= URL parameter, but if there's a draw (which there will be, if there isn't a match), the query named under sq2= will be performed ($sq1 and $sq2 refers to these to queries, and a simple substitution will be made by Solr before evaluating the query() function).
I haven't provided a default sort order, but you could add name asc as a default sort. The 0 as the second argument to query() is a value that the sort will use if there isn't a match from the query (otherwise it'll use the score from the query). You could feed this value into product() and multiply with the price, to sort each of the "buckets" by price as well if needed.
Related
If I have the following data:
Results Table
.[Required]
I want one grape
I want one orange
I want one apple
I want one carrot
I want one watermelon
Fruit Table
.[Name]
grape
orange
apple
What I want to do is essentially say give me all results where users are looking for a fruit. This is all just example, I am looking at a table with roughly 1 million records and a string field of 4000+ characters. I am expecting a somewhat slow result and I know that the table could DEFINITELY be structured better, but I have no control of that. Here is the query I would essentially have, but it doesn't seem to do what I want. It gives every record. And yes, [#Fruit] is a temp table.
SELECT * FROM [Results]
JOIN [#Fruit] ON
'%'+[Results].[Required]+'%' LIKE [#Fruit].[Name]
Ideally my output should be the following 3 rows:
I want one grape
I want one orange
I want one apple
If that kind of think is doable, I would try the other way round:
SELECT * FROM [Results]
JOIN [#Fruit] ON
[Results].[Required] LIKE '%'+[#Fruit].[Name]+'%'
This topic interests me, so I did a little bit of searching.
Suggestion 1 : Full Text Search
I think what you are trying to do is Full Text Search .
You will need Full-Text Index created on the table if it is not already there. ( Create FULLTEXT Index ).
This should be faster than performing "Like".
Suggestion 2 : Meta Data Search
Another approach I'd take is to create meta data table, and maintain the information myself when the [Result].Required values are updated(or created).
This looks more or less doable, but I'd start from the Fruit table just for conceptual clarity.
Here's roughly how I would structure this, ignoring all performance / speed / normalization issues (note also that I've switched around the variables in the LIKE comparison):
SELECT f.name, r.required
FROM fruits f
JOIN results r ON r.required LIKE CONCAT('%', f.name, '%')
...and perhaps add a LIMIT 10 to keep the query from wasting time while you're testing it out.
This structure will:
give you one record per "match" (per Result row that matches a Fruit)
exclude Result rows that don't have a Fruit
probably be ungodly slow.
Good luck!
I am using Solr 4.10.3.
I have various fields in schema. e.g id,title,content,type etc. I have docs scenario such that many docs have same type value.
id | title | content | type
1 | pro | My | abc
2 | ver | name | ht
3 | art | is | abc
and so on.
When I query Solr, I want total 10 results(as default) but in them only maximum two of type:abc. Rest of the 8 results can be of any type except abc and can be more of one type.
Is there any possible solution.?
Make two queries, once with rows=2 and type:abc, and second time with rows=8 and -type:abc. Rest of the query can be identical. Then combine the results before you show them to users.
EDIT: After some research on what comes next in Solr features, I believe that combining the results will become possible once the streaming expressions are part of Solr (maybe in 5.2). See https://issues.apache.org/jira/browse/SOLR-7377
Starting with Solr 5.2 you can accomplish this using the Streaming Expressions (and the Streaming API under the covers). The expression would be
top(n=10,
merge(
top(n=2,
search(fl="id, title, content, type", q="type:abc", sort="title asc")
),
search(fl="id, title, content, type", q="-type:abc", sort="title asc")
on="title asc"
)
)
What's going on here is that first you're finding at most two documents of type "abc", then you are finding all documents that are not of type "abc". Notice that both of these are sorted by "title asc" (it's important that they be sorted by the same field(s)). Then we merge these two streams on the field "title" in ascending order. What this does is, using the title field, pick the first document from one of the streams, then it picks the second document, then the third, and so on. By merging "on title" what we're doing is using the title field to decide which document comes first in the merged stream. The outer top will ensure that only 10 documents are returned. Because we limited the "type:abc" stream to have at most 2 documents the final merged stream will have at most 2 "type:abc" documents.
Obviously you can change the sort criteria to whatever is best for you but it is critical with a merge that the incoming streams are sorted in the same order. And in fact, the Streaming API will validate that for you and reject the expression if that is not the case.
I am new to solr, please help me in boosting fields.
I have a query like this,
q=name:test* OR description:test*
i want to apply boosting/weight age for name its 500 and for description its 50.
for example:
lets consider "test" term is appearing for 1 time in name field in one record and 20 times in description field its from another record, then boosting calculation should happen like below.
for name: 1 X 500 = 500
for Description: 20 X 50 = 1000.
as result the records with high boosting value should come at top.
so based on above calculation the record which having description field with 20 matches should come on top after that record with 1 match in name field.
If any one have solution for this, please provide
Thanks in advance.
You can boost a field at index time with the boost attribute, or you can apply a boost in the query, such as q=name:test*^50 OR description:test* (and there are some more advanced features here as well).
I bears noting though, Lucene, by default, applies a length normalization that effectively weighs matches on shorter fields more heavily than longer fields. It sounds a bit like that is what you are trying to recreate.
If you need the scoring calculation to be as simple as what you have provided, you would need to write your own Similarity class, I believe.
I'm looking for an efficient way of storing sets of objects that have occurred together during events, in such a way that I can generate aggregate stats on them on a day-by-day basis.
To make up an example, let's imagine a system that keeps track of meetings in an office. For every meeting we record how many minutes long it was and in which room it took place.
I want to get stats broken down both by person as well as by room. I do not need to keep track of the individual meetings (so no meeting_id or anything like that), all I want to know is daily aggregate information. In my real application there are hundreds of thousands of events per day so storing each one individually is not feasible.
I'd like to be able to answer questions like:
In 2012, how many minutes did Bob, Sam, and Julie spend in each conference room (not necessarily together)?
Probably fine to do this with 3 queries:
>>> query(dates=2012, people=[Bob])
{Board-Room: 35, Auditorium: 279}
>>> query(dates=2012, people=[Sam])
{Board-Room: 790, Auditorium: 277, Broom-Closet: 71}
>>> query(dates=2012, people=[Julie])
{Board-Room: 190, Broom-Closet: 55}
In 2012, how many minutes did Sam and Julie spend MEETING TOGETHER in each conference room? What about Bob, Sam, and Julie all together?
>>> query(dates=2012, people=[Sam, Julie])
{Board-Room: 128, Broom-Closet: 55}
>>> query(dates=2012, people=[Bob, Sam, Julie])
{Board-Room: 22}
In 2012, how many minutes did each person spend in the Board-Room?
>>> query(dates=2012, rooms=[Board-Room])
{Bob: 35, Sam: 790, Julie: 190}
In 2012, how many minutes was the Board-Room in use?
This is actually pretty difficult since the naive strategy of summing up the number of minutes each person spent will result in serious over-counting. But we can probably solve this by storing the number separately as the meta-person Anyone:
>>> query(dates=2012, rooms=[Board-Room], people=[Anyone])
865
What are some good data structures or databases that I can use to enable this kind of querying? Since the rest of my application uses MySQL, I'm tempted to define a string column that holds the (sorted) ids of each person in the meeting, but the size of this table will grow pretty quickly:
2012-01-01 | "Bob" | "Board-Room" | 2
2012-01-01 | "Julie" | "Board-Room" | 4
2012-01-01 | "Sam" | "Board-Room" | 6
2012-01-01 | "Bob,Julie" | "Board-Room" | 2
2012-01-01 | "Bob,Sam" | "Board-Room" | 2
2012-01-01 | "Julie,Sam" | "Board-Room" | 3
2012-01-01 | "Bob,Julie,Sam" | "Board-Room" | 2
2012-01-01 | "Anyone" | "Board-Room" | 7
What else can I do?
Your question is a little unclear because you say you don't want to store each individual meeting, but then how are you getting the current meeting stats (dates)? In addition any table given the right indexes can be very fast even with alot of records.
You should be able to use a table like log_meeting. I imagine it could contain something like:
employee_id, room_id, date (as timestamp), time_in_meeting
Where foreign keys to employee id to employee table, and room id key to room table
If you index employee id, room id, and date you should have a pretty quick lookup as mysql multiple-column indexes go left to right such that you gain index on (employee id, employee id + room id, and employee id + room id + timestamp) when do searches. This is explained more in the multi-index part of:
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
By refusing to store meetings (and related objects) individually, you are loosing the original source of information.
You will not be able to compensate for this loss of data, unless you memorize on a regular basis the extensive list of all potential daily (or monthly or weekly or ...) aggregates that you might need to question later on!
Believe me, it's going to be a nightmare ...
If the number of people are constant and not very large you can then assign a column to each person for present or not and store the room, date and time in 3 more columns this can remove the string splitting problems.
Also by the nature of your question I feel first of all you need to assign Ids to everything rooms,people, etc. No need for long repetitive string in DB. Also try reducing any string operation and work using individual data in each column for better intersection performance. Also you can store a permutation all the people in a table and assign a id for them then use one of those ids in the actual date and time table. But all techniques will require that something be constant either people or rooms.
I do not understand whether you know all "questions" in design time or it's possible to add new ones during development/production time - this approach would require to keep all data all the time.
Well if you would know all your questions it seems like classic "banking system" which recalculates data on daily basis.
How I think about it.
Seems like you have limited number of rooms, people, days etc.
Gather logging data on daily basis, one table per day. Just one event, one database row, all information (field) what you need.
Start to analyse data using some crone script at "midnight".
Update stats for people, rooms, etc. Just increment number of hours spent by Bob in xyz room etc. All what your requirements need.
As analyzed data are limited and relatively small as you analyzed (compress) them, your system can contain also various queries as indexes would be relatively small etc.
You could be able to use scalable map/reduce algorithm.
You can't avoid storing the atomic facts as follows: (the meeting room, the people, the duration, the day), which is probably only a weak consolidation when the same people meet multiple times in the same room on the same day. Maybe that happens a lot in your office :).
Making groups comparable is an interesting problem, but as long as you always compose the member strings the same, you can probably do it with string comparisons. This is not "normal" however. To normalise you'll need a relation table (many to many) and compose a temporary table out of your query set so it joins quickly, or use an "IN" clause and a count aggregate to ensure everyone is there (you'll see what I mean when you try it).
I think you can derive the minutes the board room was in use as meetings shouldn't overlap, so a sum will work.
For storage efficiency, use integer keys for everything with lookup tables. Dereference the integers during the query parsing, or just use good old joins if you are feeling traditional.
That's how I would do it anyway :).
You'll probably have to store individual meetings to get the data you need anyway.
However you'll have to make sure you aggregate and anonymise it properly before creating your reports. Make sure to separate concerns and access levels to stay within the proper legal limits on data.
I have a pair of multi-valued index fields, author and author_norm, and I've created a hierarchical facet field for them using the pattern described at https://wiki.apache.org/solr/HierarchicalFaceting#Indexed_Terms. The facet values look like this:
0/Blow, J
1/Blow, J/Blow, Joe
1/Blow, J/Blow, Joseph
1/Blow, J/Blow, Jennifer
0/Smith, M
1/Smith, M/Smith, Michelle
1/Smith, M/Smith, Michael
1/Smith, M/Smith, Mike
Authors are associated to article records, and in most cases an article will have many authors. This means that for a Solr query that returns 100+ articles, there will potentially be 1000+ authors represented.
My problem is that when I go to display this hierarchy to the user, due to my facet.limit and facet.mincount being set to sane values, I do not have the complete set of 2nd-level values, i.e., the 2nd level of my hierarchy will be cut-off at a certain point. I will have something like this:
Blow, J (30)
Blow, Joe (17)
Blow, Joseph (9)
Smith, M (22)
Smith, Michelle (14)
Smith, Michael (6)
I would like to also have the "Blow, Jennifer (4)" and "Smith, Mike (2)" entries in this list, but they're not returned in the response because mincount cutoff is 5. So I end up with a confusing display (17 + 9 != 30, etc).
One option would be to put a little "(more)" link at the bottom of every 2nd-level list and fetch the full set via ajax. I'm not crazy about this solution because it's asking users to work/click more than they really should have to, and also because I can't control the length of the initial 2nd-level listing; sometimes it'll be 3 names + "(more)", sometimes 2 or even 1. That's just ugly.
I could set mincount=1 and limit=-1 for just my hierarchical facet field, but that would be nuts because for a large query (100k hits) I would be fetching 100k+ values I don't need. I only need the full set of 2nd-level values for the top N 1st-level values.
So unless someone has a better suggestion, I'm assuming I need to do some kind of follow-up query. So after all that, here's what I'm really asking: is is there a way to fetch these 2nd-level values in a single follow-up query. Given an initial solr response, how can I get all the 2nd-level permutations of just the top N 1st-level values of my hierarchy?
Thanks!
PS, I'm using Solr 4.0.
You can modify mincount for any level in pivot:
facet.pivot=fieldA,filedB&f.fieldA.limit=3&f.fieldB.limit=-1
Problems starts when both fields are the same facet.pivot=fieldA,filedA in that case I would probably create a copy of fieldA as fieldB