Multi term search in SOLR (WebSphere Commerce7) - solr

My search results for Electric blanket brings up all products specific to electric/blanket/electric blanket". However, I need results only specific to electric blanket.
The query my application sends to SOLR has :
q="electric" "blanket"
What change is required at SOLR config end to make this search only for electric blanket?

In your schema.xml, add to the end of the file, before </schema>:
<solrQueryParser defaultOperator="AND"/>
Solr Documentation: https://wiki.apache.org/solr/SchemaXml#Default_query_parser_operator

in SearchSetup.jspf make the default value of searchType = 1001 instead of 1000
<c:set var="searchType" value="1001" scope="request"/>
below is the explanation - also you can find complete list in same file :
13. ANY | 1000 | INCLUDE products, kits, bundles, category level SKUs
| (Default) | EXCLUDE product level SKUs
| |
14. EXACT | 1001 | INCLUDE products, kits, bundles, category level SKUs
| | EXCLUDE product level SKUs
| |
15. ALL | 1002 | INCLUDE products, kits, bundles, category level SKUs
| | EXCLUDE product level SKUs
EXACT will force solr to match the whole sentence , same as when you type your search term in inte qoute "search term" , that will search in products that have EXACT sentence in product name , short description , category name , SEO Keywords , name override and description override
Hope this answer you question.
Thanks
Abed

You can not have results only specific to just electric blanket. You can however ensure the results specific to electric blanket are more relevant and returned at the top of your results.
The configuration of Solr that Websphere Commerce offers out-of-the-box is very basic. In order to receive desired results takes a lot of tuning and configuration changes in both Solr and Webshere Commerce (We resorted to writing a custom class). sarona.co.uk has some nice information about this in their latest blogs.

Related

Frequency of search term using solr

I have a secondary index which stores the search terms executed on a primary index for searching documents. I want to run a search on the secondary index and list down the search terms in descending order of frequency of execution like I want to find the top 10 most searched terms.
The secondary index stores data in this format
Search Term | Date ...<some more irrelevant fields>
term1 | 01-01-2018
term2 | 01-01-2018
term3 | 02-01-2018
term1 | 02-01-2018
term3 | 03-01-2018
I need something like this which I can use java to manipulate. So any json with the search term and frequency from solr is okay.
Searh Term, Frequency
term1, 2
term2, 1
term3, 2
I have looked up some articles which state the use of Term Vector Component but those articles run search on the number of times a specific term exists in a document.
Can someone help me to get the desired result.
Thanks
You can use faceting to tally how often a given token appears in a field.
&facet=true&facet.field=term&facet.sort=count
There are also many other parameters you can give, such as to order by term or count.

Relevance and Solr Grouping

Say I have the following collection of webpages in a Solr index:
+-----+----------+----------------+--------------+
| ID | Domain | Path | Content |
+-----+----------+----------------+--------------+
| 1 | 1.com | /hello1.html | Hello dude |
| 2 | 1.com | /hello2.html | Hello man |
| 3 | 1.com | /hello3.html | Hello fella |
| 4 | 2.com | /hello1.html | Hello sir |
...
And I want a query for hello to show results grouped by domain like:
Results from 1.com:
/hello1.html
/hello2.html
/hello3.html
Results from 2.com:
/hello1.html
How is ordering determined if I sort by score? I use a combination of TF/IDF and PageRank for my results normally, but since that calculates scores for each individual item, how does it determine how to order the gruops? What if 1.com/hello3.html and 1.com/hello2.html have very low relevance but two results while 2.com/hello1.html has really high relevance and only one result? Or vice versa? Or is relevance summed when there are multiple items in a grouping field?
I've looked around, but haven't been able to find a good answer to this.
Thanks.
It sounds to me like you are using Result Grouping. If that's the case, then the groups are sorted according to the sort parameter, and the records within each group are sorted according to the group.sort parameter. If you sort the groups by sort=score desc (this is the default, so you wouldn't actually need to specify it), then it sorts the groups according to the score of each group. How this score is determined isn't made very clear, but if you look through the examples in the linked documentation you can see this statement:
The groups are sorted by the score of the top document within each group.
So, in your example, if 2.com's hello1.html was the most relevant document in your result set, "Results from 2.com" would be your most relevant group even though "Results from 1.com" includes three times the document count.
If this isn't what you want, your best options are to provide a different sort parameter or result post-processing. For example, for one project I was involved in, (where we had a very modest number of groups,) we chose to pull the top three results for each group and in post processing we calculated our own sort order for the groups based on the combination of their scores and numFound values. This sort of strategy might have been prohibitive for cases with too many groups, and may not be a good idea if the more numerous groups run the risk of making the most relevant documents harder to find.

Cassandra/Solr data model improvement

I have the following table:
CREATE TABLE videos_tags (
id text,
tag text,
video text,
someotherfield long,
PRIMARY KEY (id),
) WITH gc_grace_seconds = 1296000
AND compaction={'class': 'LeveledCompactionStrategy'}
AND compression={'sstable_compression': 'LZ4Compressor'};
The table stores a list of tags and videos. A video can have one or more tags; and a tag can be attributed to more than one video. Example:
id | tag | video
------------------------------------------
1 | dancing | video1
2 | singing | video2
3 | prank | video3
4 | prank | video4
5 | funny | video3
6 | cover | video2
I want to show to my users a list of related videos based from tag assignment - the more tags a certain video has in common with the user's video, the more "related" it is. The actual approach that I use comprises of 2 steps:
Get a list of the user's video's tags
q=:&fq=video:video1&fl=tag
Identify the videos use the same tags as the user's video and select the top 10 (resultset slicing is done in application side)
q=:&fq=tag:tag1 AND tag:tag2 AND tag:tag3 AND !video:video1&fl=video&stats=true&stats.field=someotherfield&stats.facet=video
Note: I used stats instead of plain facet because I also need the sum of someotherfield
This approach yields an average execution time of 30 seconds. Unfortunately, the maximum acceptable query time for my app is 10 seconds
Is there a better approach to tackling this data requirement? I'm open to:
Alternative query approach (minor tweaks are preferred; but I can accept something as drastic as replacing my 2-step approach completely)
Alternative schema
Notes:
The actual schema has several other fields that I removed from this post for brevity
I do all read operations via Solr (Datastax Enterprise 4.6.0). Nothing fancy in the Solr schema
The table currently holds 1.5 billion rows, but could grow to double or triple of that within years (so the solution must take into account the table/index size)
No fulltext search - only exact string filters

SOLR - Grouping results with group.limit return wrong numFound

When I do a search with grouping result and perform group limit, I get that numFound is the same as I when I don’t use the limit.
It looks like SOLR first performs search and calculates numFound and then limit the results.
I can't use pagination and other stuff.
Is there any workaround or I missed something ?
Example:
======================================
| id | publisher | book_title |
======================================
| 1 | A1 | Title Book |
| 2 | A1 | Book title 123 |
| 3 | A1 | My book |
| 4 | B2 | Hi book title |
| 5 | B2 | Another Book |
If I perform query:
q=book_title:book
&group=true
&group.field=publisher
&group.limit=1
&group.main=true
I will get numFound 5 but only 2 in the results.
"response": {
"numFound": 5,
"docs": [
{
"book_title": "My book",
"publisher": "A1"
},
{
"book_title": "Another Book",
"publisher": "B2"
}
]
}
Set group.ngroups to true.
That will produce
"grouped": {
"bl_version_id": {
"matches": 53,
"ngroups": 18,
"groups": [
{
...
I had the same problem, couldn't find a way to fix the root cause, but I will share my solution as a workaround.
What I did is
Facet by the field I'm grouping on.
Count the number of unique facets. This will match the number of unique documents (2 in your case)
Add these faceting parameters to your query:
&facet=true
&facet.limit=-1
&facet.field=publisher
Notes:
This is a bit expensive, but it's the only way that worked for me (so far).
This will only work if publisher is not multi-valued
numFound indicate total no. of document matched for current query, here in your case 5 is correct, though you gave group.limit=1 it will give max. 1 document per group even though there are many documents resides in that group.
I suggest you to use group.limit=-1 in your query it will return all 5 documents in result.
For more information please check details given below.
solr fieldcollapsing and maximum group.limit
http://wiki.apache.org/solr/FieldCollapsing
group.limit isn't real limit, it's only NumRows to return.
There is no easy solution implemented in Solr for my problem.
You may find answer here
Solr User Group
numFound refers to the total number of documents found by solr after executing your query, which is also something that you're gonna need to do pagination based on that query.
Pagination in solr is pretty much like you handle it with regular RDBMSs, you're gonna need to use the start and the rows parameters, for instance, executing the following query will result to fetch 10 documents starting from document number 20:
?q=you_key_word&start=20&rows=10
This query will fetch for you the desired content for the target page "this would generate page number 3 in this case assuming that you have 10 docs/page", and of course instead of executing another query to get the total number of documents to know the number of pages, you would have this info auto generated for you represented by the value of "numFound".
Hope this helps

(Full-Text) Search And Database Design

This is a system architecture question on designing full-text search with (relational) database. The specific software I'm using are Solr and PostgreSQL, just FYI.
Suppose we are building a forum with two users Andy and Betty --
Post ID | User | Title | Content
--------|-------|-------------------|---------------------------
1 | Andy | Dark Knight rocks | Dark Knight rocks blah
2 | Betty | I love Twilight | Twilight blah blah
3 | Andy | Twilight sucks | Twilight sucks blah
4 | Betty | Andy sucks | Twilight rocks, Andy sucks
When the posts table is indexed in Solr, we can easily return the posts sorted by their relevancy to "?q=twilight" or "?q=dark+night".
Now we want to add a new feature to search for users instead of posts. A naive implementation would simply index user name and return "Andy" to "?q=a" and "Betty" to "?q=b", but what if we want to make our system smarter to also take into account of the user posts and return "Betty" before "Andy" to "?q=twilight" because Betty mentions Twilight more than Andy does.
How would you design the system to efficiently handle the user-search function for hundreds of thousands of users and millions of posts?
Faceting on User would return number of results per user. If Andy wrote 15 posts that match Twilight while Betty wrote 10, the faceting will return them as such.
But it wont help if both wrote 15 posts about Twilight, but Andy's was supposed to be more relevant; you will see all facet counts (15, 15 in this case) even if you are paginating to see only (say,) top 5 results and Andy made 4 of them.
If above solution is not good enough, consider a background job that writes documents of
type: suggest_user_type (so you can distinguish them by a `fq`)
user: Andy (the user)
concatted_posts: "I think Twilight.." (concatenate the users latest 50 posts)
once a week. And if you
fq=type:suggest_user_type&
q=concatted_posts:twilight&
fl=user
you get a sorted list of users based on relevance of concatted_posts with respect to twilight.
I believe term frequency is included in full text search ranking. It's part of a research area called information retrieval. There's also another value called the inverse document frequency, which filters out common terms.
There are other steps common to ranking text, you may want to have a look at the OpenNLP project if you're interested.
In terms of database design, there's too much to cover in a post and I'm not the one to write it. The general consensus seems to be for very large systems they key is building an efficient index, then distributing this over a number machines to scale performance. I would recommend reading up on Page Rank and how Google developed its systems as a starting point.

Resources