How to get tf-idf score and bm25f score of a term in a document using whoosh? - tf-idf

I am using whoosh to index a dataset. I want to retrieve the td-idf score and bm25f score given a term and document? I have seen the scoring.TFIDF() and scoring.TFIDFScorer(). In order to call TFIDFScorer().score() method we should pass a matcher object. Which matcher object should I pass to it.
Similarly, what parameters should I pass to BM25FScorer()._score(self, weight, length)? What are weight and length parameters? What values are passed by default?

Finally able to figure it out. Here it is for anyone who come here later,
For finding TFIDF and BM25F score of a term and document.
qp = QueryParser('content', ix.schema)
q = qp.parse(unicode('id:1'))
with ix.searcher(weighting=scoring.TF_IDF()) as searcher_tfidf:
scoring.TFIDF().scorer(searcher_tfidf, 'body', 'algebra').score(q.matcher(searcher_tfidf))
with ix.searcher(weighting=scoring.BM25F()) as searcher_bm25f:
scoring.BM25F().scorer(searcher_bm25f, 'body', 'algebra').score(q.matcher(searcher_bm25f))
ix is IndexReader object obtained using open_dir() method or create_in(). The key is to get the Matcher object that matches exactly the required document. So, use an id or any unique field in the schema to get that particular document using qp.parse() method.

Related

Testing to see if a inputted value is present in an array Mongodb

So i have a small test collection that has this format...
Sample Data
I have created a simple query that returns an array with all the ids..
All_Ids = db.Test_Collection.find({}, {_id:1}).map(function(item){ return item._id; })
[ 98800754, 15301328, 76812898 ]
And i Want to take in an inputted id and check to see if that inputted value is present in the array..
here was my initial attempt..
> query_Figure = 98800754
98800754
> db.Test_Collection.find({query_Figure: {$in: All_Ids}})
I tried using the $in operator to find that specific value.. with the array and specific id to be searched as variables, but had no luck as the query returned nothing, when the value im searching for is clearly in the array
As you can tell I am a newbie and would appreciate some help in improving the query!
$in operator returns the documents in which value of the specified field is present in the array.
In your case, since query_Figure is not a field in any of your documents, your query returns no matching documents.
If you just want to check if the input id is present in the All_Ids array, then you don't need a query. Just use Array.prototype.includes()
const exists = All_Ids.includes(query_Figure);

MongoDB numeric index

I was wondering if it's possible to create a numeric count index where the first document would be 1 and as new documents are inserted the count would increase. If possible are you also able to apply it to documents imported via mongoimport? I have created and index via db.collection.createIndex( {index : 1} ) but it doesn't seem to be applying.
I would strongly recommend using ObjectId as your _id field. This has the benefit of being a good value for distributed systems, but also based on the date it was created. It also has a built-in index inside MongoDB.
Example using Morphia:
Date d = ...;
QueryImpl<MyClass> query = datastore.createQuery(MyClass);
query.field("_id").greaterThanOrEq(new ObjectId(d));
query.sort("_id");
query.limit(100);
List<MyClass> myDocs = query.asList();
This would fetch all documents created since date d in order of creation.
To load the next batch, change to:
query.field("_id").greaterThan(lastDoc.getId());
This will very efficiently load the next batch based on the ID of the last document from the previous batch.

Solr score boost - based on number of likes

I have added fs_votingapi_result in solr document this represents number of likes.
I found below function to improve the score based on fs_votingapi_result.
But I am unable to get the logic behind this - what are the extra parameters $vote_steepness, $total, $total, $vote_boost?
bf=recip(rord(fs_votingapi_result),$vote_steepness,$total,$total)^$vote_boost
I am new to solr and I am not able to find any document/article to get more idea about this.
This is in the Function Query documentation.
recip
A reciprocal function with recip(x,m,a,b) implementing a/(m*x+b). m,a,b are constants, x is any numeric field or arbitrarily complex function.
rord
The reversed ordinal of the indexed value. (In your case, the function: rord(fs_votingapi_result) would yield 1 for the record w the most votes, 2 for the second most votes, etc...)
So
recip(rord(fs_votingapi_result),$vote_steepness,$total,$total)
= $total / ($vote_steepness * rev-ordinal-of-vote-result + $total)
Then the result is boosted by $vote_boost to create the boost function (from bf param).
= ($total / ($vote_steepness * rev-ordinal-of-vote-result + $total)) * $vote_boost
Which is added to the document score from the rest of the query. (Then before scores are returned, they are normalized across all matching docs)
The $<var> values are either defined in solrconfig.xml or more commonly passed as separate http query parameters.
Hope that gives you a starting point.

Mongodb: how to auto-increment a subdocument field?

The following document records a conversation between Milhouse and Bart. I would like to insert a new message with the right num (the next in the example would be 3) in a unique operation. Is that possible ?
{ user_a:"Bart",
user_b:"Milhouse",
conversation:{
last_msg:2,
messages:[
{ from:"Bart",
msg:"Hello"
num:1
},
{ from:"Milhouse",
msg:"Wanna go out ?"
num:2
}
]
}
}
In MongoDB, arrays keep their order, so by adding a num attribute, you're only creating more data for something that you could accomplish without the additional field. Just use the position in the array to accomplish the same thing. Grabbing the X message in an array will provide faster searches than searching for { num: X }.
To keep the order, I don't think there's an easy way to add the num category besides does a find() on conversation.last_msg before you insert the new subdocument and increment last_msg.
Depending on what you need to keep the ordering for, you might consider including a time stamp in your subdocument, which is commonly kept in conversation records anyway and may provide other useful information.
Also, I haven't used it, but there's a Mongoose plugin that may or may not be able to do what you want: https://npmjs.org/package/mongoose-auto-increment
You can't create an auto increment field but you can use functions to generate and administrate sequence :
http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/ 
I would recommend using a timestamp rather than a numerical value. By using a timestamp, you can keep the ordering of the subdocument and make other use of it.

Using SOLR how do I return the best result for a set of integer preferences

I'm trying to create a query that returns the best product depending on a few required attributes and a few optional ones that just affect the weighting.
Properties 1-3 required
Properties 4-5 optional
ratings 1-3 optional
The data is structured in the solr db like so:
property1 (string)
property2 (string)
property3 (string)
property4 (string)
property5 (string)
rating1 (int)
rating2 (int)
rating3 (int)
The query I've created so far get's me close, but it does not take in account how close the optional fields are to the specific requested value.
An example is the ratings are valued 1-5 for arbitrary properties such as efficiency or usefulness. I need it to acknowledge that if the user wants rating1 set to 4 then values 3 and 5 are still valid, just equally less so. Also a value of 2 is weighted more then 1. So it basically creates a scale based on how far the product is from the desired rating value.
defType = dismax
sort = score desc
fl = entity_id,score,property4,property5,rating1,rating2,rating3
fq = property1:215 property2:45 property3:17
bq = property4:(H)^5 OR property5:(87)^5 OR rating1:(1)^5 OR rating2:(3)^5 OR rating3:(5)^5
Since you have the rules for doing the math on the rating, I would go with a function query. You could do any math that you think works best in this case and the result could affect the boost score.

Resources