I have a secondary index which stores the search terms executed on a primary index for searching documents. I want to run a search on the secondary index and list down the search terms in descending order of frequency of execution like I want to find the top 10 most searched terms.
The secondary index stores data in this format
Search Term | Date ...<some more irrelevant fields>
term1 | 01-01-2018
term2 | 01-01-2018
term3 | 02-01-2018
term1 | 02-01-2018
term3 | 03-01-2018
I need something like this which I can use java to manipulate. So any json with the search term and frequency from solr is okay.
Searh Term, Frequency
term1, 2
term2, 1
term3, 2
I have looked up some articles which state the use of Term Vector Component but those articles run search on the number of times a specific term exists in a document.
Can someone help me to get the desired result.
Thanks
You can use faceting to tally how often a given token appears in a field.
&facet=true&facet.field=term&facet.sort=count
There are also many other parameters you can give, such as to order by term or count.
Related
I am using Apache Solr 8 with products as documents. Each document includes sales within the last X days that I want to boost, as well as a title and other fields.
Say productA has been sold 5 times, I want to boost it with score+10; a productB has been sold 50 times, I want to boost the score by 30.
I tried to use a boostFunction that looks like (edismax query parser)
q=Coffee&qf=title&bf=if(lt(sales,5),10,if(lt(sales,50),30))
Solr now returns documents that have nothing to do with my "Coffee"-Query but just match the boostfunction. There are even results with score "0".
E.g.
Rank;Score;Sales;Title
1;58.53;55;Coffee big
2;38.11;50;Coffee
3;30;55;Tea
Any idea to get rid of those "only boost function"-matches?
Found the answer!
My Query-Fields actually included boostings like
&qf=title^2 longDescription^0 whatever^0...
Instead of excluding the results found in those 0-boosted fields, solr adds them and matches with - well score 0.
When I remove the 0-boostings, everything works as intended.
I am currently using Azure Search to perform product searches on my website.
I have the following indexes:
A: Index with 55,000 documents
B: Inde with 16 documents
All documents in index B were filled with index A documents
When performing a simple search in the 2 indices with the same parameters the results are not what I expect.
Example:
Index A:
Query String: search=kfc
Result sorted by search.score descending:
ProductoName - search.score
KFC Product1 - 1.6514521
KFC Product2 - 1.5482594
Index B:
Query String: search=kfc
Result sorted by search.score descending:
ProductoName - search.score
KFC Product2 - 0.21555252
KFC Product1 - 0.13616839
I am surprised the order of the results by search score changes, because they are exactly the same data only the amount of documents changes
The amount of documents affect in the assignment of search score ?, Could you indicate where I can read about it, I look in the documentation but I did not find anything about it
Could you explain to me why the order of the products is affected if it is the same information? :(
The Index has no Scoring Profile and is exactly the same information
Your analysis is correct, scoring (and thus ranking) is indeed affected by the number of documents in the index. To compute scores we use some statistical characteristics of the data corpus, such as the frequency of each term across the entire corpus and within each document.
The article How full text search works in Azure Search explains this in great detail. In particular, the section on Scoring goes into how frequencies (term frequency, document frequency) are used.
I am using Solr 4.10.3.
I have various fields in schema. e.g id,title,content,type etc. I have docs scenario such that many docs have same type value.
id | title | content | type
1 | pro | My | abc
2 | ver | name | ht
3 | art | is | abc
and so on.
When I query Solr, I want total 10 results(as default) but in them only maximum two of type:abc. Rest of the 8 results can be of any type except abc and can be more of one type.
Is there any possible solution.?
Make two queries, once with rows=2 and type:abc, and second time with rows=8 and -type:abc. Rest of the query can be identical. Then combine the results before you show them to users.
EDIT: After some research on what comes next in Solr features, I believe that combining the results will become possible once the streaming expressions are part of Solr (maybe in 5.2). See https://issues.apache.org/jira/browse/SOLR-7377
Starting with Solr 5.2 you can accomplish this using the Streaming Expressions (and the Streaming API under the covers). The expression would be
top(n=10,
merge(
top(n=2,
search(fl="id, title, content, type", q="type:abc", sort="title asc")
),
search(fl="id, title, content, type", q="-type:abc", sort="title asc")
on="title asc"
)
)
What's going on here is that first you're finding at most two documents of type "abc", then you are finding all documents that are not of type "abc". Notice that both of these are sorted by "title asc" (it's important that they be sorted by the same field(s)). Then we merge these two streams on the field "title" in ascending order. What this does is, using the title field, pick the first document from one of the streams, then it picks the second document, then the third, and so on. By merging "on title" what we're doing is using the title field to decide which document comes first in the merged stream. The outer top will ensure that only 10 documents are returned. Because we limited the "type:abc" stream to have at most 2 documents the final merged stream will have at most 2 "type:abc" documents.
Obviously you can change the sort criteria to whatever is best for you but it is critical with a merge that the incoming streams are sorted in the same order. And in fact, the Streaming API will validate that for you and reject the expression if that is not the case.
Say I have the following collection of webpages in a Solr index:
+-----+----------+----------------+--------------+
| ID | Domain | Path | Content |
+-----+----------+----------------+--------------+
| 1 | 1.com | /hello1.html | Hello dude |
| 2 | 1.com | /hello2.html | Hello man |
| 3 | 1.com | /hello3.html | Hello fella |
| 4 | 2.com | /hello1.html | Hello sir |
...
And I want a query for hello to show results grouped by domain like:
Results from 1.com:
/hello1.html
/hello2.html
/hello3.html
Results from 2.com:
/hello1.html
How is ordering determined if I sort by score? I use a combination of TF/IDF and PageRank for my results normally, but since that calculates scores for each individual item, how does it determine how to order the gruops? What if 1.com/hello3.html and 1.com/hello2.html have very low relevance but two results while 2.com/hello1.html has really high relevance and only one result? Or vice versa? Or is relevance summed when there are multiple items in a grouping field?
I've looked around, but haven't been able to find a good answer to this.
Thanks.
It sounds to me like you are using Result Grouping. If that's the case, then the groups are sorted according to the sort parameter, and the records within each group are sorted according to the group.sort parameter. If you sort the groups by sort=score desc (this is the default, so you wouldn't actually need to specify it), then it sorts the groups according to the score of each group. How this score is determined isn't made very clear, but if you look through the examples in the linked documentation you can see this statement:
The groups are sorted by the score of the top document within each group.
So, in your example, if 2.com's hello1.html was the most relevant document in your result set, "Results from 2.com" would be your most relevant group even though "Results from 1.com" includes three times the document count.
If this isn't what you want, your best options are to provide a different sort parameter or result post-processing. For example, for one project I was involved in, (where we had a very modest number of groups,) we chose to pull the top three results for each group and in post processing we calculated our own sort order for the groups based on the combination of their scores and numFound values. This sort of strategy might have been prohibitive for cases with too many groups, and may not be a good idea if the more numerous groups run the risk of making the most relevant documents harder to find.
i'm trying to write a code for vsm search in c. So using a collection of documents i built a hashtable (inverded index) in wich each slot holds a word along with it's df and a pointer to a list in which each slot hold a name of a document(in which the word appeared at least once) along with the tf(how many times it appeared in this doccument). The user will write a question(also chooses weighting qqq.ddd and comparing method but that doesn't matter for my question) and i have to print him the documents that are relevant to it(from the most relevant to the least relevant). So the examples i've seen are showing which are the steps having only one document for example: we have a collection of 1.000.000 documents(N=1.000.000) and we want to compare
1 document: car insurance auto insurance
with the queston: best car insurance
So in the example it creates an array like this:
Term | Query | Document
| tf | tf
auto | 0 | 1
best | 1 | 0
car | 1 | 1
insurance| 1 | 2
The example also gives the df for each term so using these clues and the weighting and comparing methods it's easy to compare them turning them into vectors by finding the 4 coordinates(1 for each word in the array).
So in this example there are 1.000.000 documents and to see how relevant the document with the query is we use 1 time each(4 words) of the words that there are in the query and in the document. So we have to find 4 coordinates and then compare.
In what i'm trying to do there are like 8000 documents each of them having from 3 to 50 words. So how am i suppose to compare how relevant is a query with each document? If i have
a query: ping pong
document 1: this is ping kong
document 2: i am ping tongue
To compare the query-document1 i will use the words: this is ping kong pong (so 5 coordinates) and to compare the query-document2 i will use the words: i am ping tongue is kong (6 coordinates) and then since i use the same comparing method the one with the highest score is the most relevant? OR do i have to use for both the words: this is ping kong am tongue kong (7 coordinates)? So my question is which is the right way to compare all these 8000 documents with the question? I hope i succeed on making my question easy to understand. thank you for your time!