Modeling Ranking (scores) in Non Relational DataBase (NoSQL) - google-app-engine

I'm using Google App Engine, so I'm using a Non relational database (NoSQL). My question is:
Which is the best option to modeling a rank (ranking of players) using their scores?
For example, my players are:
Player { String name, int score}
I want to know the rank (position) from a player and also get the top 10 players, but I doubt which is the best way.
Thanks.

If your scores are indexed, it's easy to do a datastore query and get players in sorted order.
So if you want the top 10 players, that's pretty trivial.
Getting the ranking for an arbitrary player is really hard. Hard enough that I'd say, avoid it if you can, and if you can't, find a hack way around it.
For example, if you have 50,000 players, and PlayerX is ranked 12,345, the only way to know that is query all the players, and check through each of them, keeping count, until you find PlayerX.
One hack might be to store the player ranking in the player entity, and update it with a cron job that runs once every few hours.

There is a built-in solution in Redis:
First add a few members with a score:
redis> ZADD myzset 1 "one"
(integer) 1
redis> ZADD myzset 2 "two"
(integer) 1
redis> ZADD myzset 3 "three"
(integer) 1
Get the rank of "two":
redis> ZREVRANK myzset "one"
(integer) 2
(Index starts at 0)
And if you want the current order:
redis> ZREVRANGE myzset 0 -1
1) "three"
2) "two"
3) "one"
See ZREVRANGE and ZREVRANK in redis documentation.

A suitable representation of this in JSON would be:
"players" : [
{
"name" : "John",
"score" : 15
},
{
"name" : "Swadq",
"score" : 7
},
{
"name" : "Jane",
"score" : 22
}
]
For examples of how to sort this:
PHP: How to sort an array of associative arrays by value of a given key in PHP?
JavaScript: How to sort an array of associative arrays by value of a given key in PHP?
JavaScript general sorting: http://www.breakingpar.com/bkp/home.nsf/0/87256B280015193F87256C8D00514FA4

You could set up your index.yaml like so:
- kind: Player
properties:
- name: score
direction: ascending
To get a player's score you just need to make a pass over the players (while keeping a count) and cache the result to speed further searches for that player.

Related

Compare 2 tables, find the same text and have it substract the numbers next to it

Let's say there are 2 tables, the first one being randomly generated. The first table has Alex with 10 Apples, Bogdan with 2 pears and Cristi with 5 oranges. In the second table I write that all over, the only difference being that if someone that shouldn't have oranges (only Alex and Bogdan can have them in this case) does, then the numbers of oranges will subtract from the total amount of fruits.
The "oranges" in my google sheet are under the "CT" name and the subtraction will be made in "Project Automation Checker" column J, the only people that can have "CT" are the range named "MDI".
https://docs.google.com/spreadsheets/d/1n8DF771658l-7lIMu2Jx7YF9ZoHGb3H8UA0eOVd8iaE/edit?usp=sharing
it's really hard to tell what exactly are you after due to no provided examples of the desired output...
The "oranges" in my google sheet are under the "CT" name
=FILTER(A1:C10, C1:C10="CT", NOT(COUNTIF(MDI, A1:A10)))
only if the guy is found in the other table, has a "CT" and is not in the range "MDI"
=FILTER(A1:C10, C1:C10="CT", NOT(COUNTIF(MDI, A1:A10)), COUNTIF(G1:G10, A1:A10))

Is it possible to change the scoring profile based on the number of tags?

I have a document with a collection of strings representing the number of times that document appears in a region (tags). For example:
[{
"id": "A"
// other properties
"regions": ["3", "3", "3", "2"] // Appears 3 times in region "3" and once in region "2"
},
{
"id": "B"
// other properties
"regions": ["3", "3", "1"] // Appears twice in region "3" and once in region "1"
}]
I tried using a custom scoring profile of type Tag, but I don't see how to give documents with more regions a better score. In other words, I want document A that appears 3 times in region 3 to show before document B that only appears twice in region 3.
FYI, the reason we chose to represent regions this way is because there are way too many regions and not all documents appear in all regions. More details here
Is this doable? This way or another way?
The tag scoring profile checks for an existence of a tag. If the tag appears multiple times, it has no effect on the score.
I've read your other post here. One solution you could consider (which is not exactly what you want) is to bucket the regions based on count. For example, you'd have a collection of regions where the document shows up less than 10 times, between 10 and 50, between 50 and 100 (pick the ranges in a way that make sense for the distribution of region occurrences in your scenario). You're documents would look like this:
{
"id": "A"
"regions10": ["3", "2"] // Appears in region 3 and 2 less than 10 times
"regions50": ["1"] // Appears in region 1 between 10 and 50 times
}
Then, you could use a Weights scoring profile to boost documents that matched in the higher count regions:
"scoringProfiles": [
{
"name": "boostRegions",
"text": {
"weights": {
"regions10": 1,
"regions50": 2,
"regions100": 3
}
}
}
This is not a good solution if you need strict ordering based on the region count, you can't precompute the region counts, or the entire range of value is large (say 0 to 2^31) while the individual buckets need to be small (you'd end up with too many fields).
The problem you have is a data modeling problem. You're trying to retrieve documents based on the property of the document, which is whether it contains a region in a set of regions, but score/boost the document based on the properties of the region, not the document. You'd have to have a document in the index for each document-region pair an a property with the number of times given document appeared in that region.

Having trouble sorting without grouping/field collapsing in Solr

Is it possible to do a compound sort in solr without Field Collapsing?
If I have two car models, Ford and Chevy, can I sort first on Ford where price is less than 2,000, then Ford > 2,000, then the Chevy models? I would like to do this without grouping, and without applying a price sort to the Chevy models.
For example, something like &sort=Model:"Ford" AND price:[0 TO 2000]
so that I get:
Ford 1, $1000
Ford 2, $500
Ford 2, $1500
_________
Ford 3, $3000
Ford 3, $5000
_______
Chevy 1
Chevy 2
Chevy 3
I've tinkered a bit with this, and I've come up with a solution based on the query() function, since you can use that together with sorting. I'm not sure about the performance, and depending on the number of documents in your index, that might not be important, so the only way is to try it and see if it performs. I've used name and price as my two fields in the schema, which I think would map to your Model and price fields.
The way sort works is that each clause is evaluated in order, so that the first sort description is performed first, then the next one if there's a draw, and so on.
I've removed url escaping and formatted everything a bit:
sort=query($sq1,0) asc,query($sq2,0) asc
&sq1=name:Ford* AND price:[0 TO 1500]
&sq2=name:Ford*
This implies that the first sort is performed on the query named in the sq1= URL parameter, but if there's a draw (which there will be, if there isn't a match), the query named under sq2= will be performed ($sq1 and $sq2 refers to these to queries, and a simple substitution will be made by Solr before evaluating the query() function).
I haven't provided a default sort order, but you could add name asc as a default sort. The 0 as the second argument to query() is a value that the sort will use if there isn't a match from the query (otherwise it'll use the score from the query). You could feed this value into product() and multiply with the price, to sort each of the "buckets" by price as well if needed.

Vector Space Model query - set of documends search

i'm trying to write a code for vsm search in c. So using a collection of documents i built a hashtable (inverded index) in wich each slot holds a word along with it's df and a pointer to a list in which each slot hold a name of a document(in which the word appeared at least once) along with the tf(how many times it appeared in this doccument). The user will write a question(also chooses weighting qqq.ddd and comparing method but that doesn't matter for my question) and i have to print him the documents that are relevant to it(from the most relevant to the least relevant). So the examples i've seen are showing which are the steps having only one document for example: we have a collection of 1.000.000 documents(N=1.000.000) and we want to compare
1 document: car insurance auto insurance
with the queston: best car insurance
So in the example it creates an array like this:
Term | Query | Document
| tf | tf
auto | 0 | 1
best | 1 | 0
car | 1 | 1
insurance| 1 | 2
The example also gives the df for each term so using these clues and the weighting and comparing methods it's easy to compare them turning them into vectors by finding the 4 coordinates(1 for each word in the array).
So in this example there are 1.000.000 documents and to see how relevant the document with the query is we use 1 time each(4 words) of the words that there are in the query and in the document. So we have to find 4 coordinates and then compare.
In what i'm trying to do there are like 8000 documents each of them having from 3 to 50 words. So how am i suppose to compare how relevant is a query with each document? If i have
a query: ping pong
document 1: this is ping kong
document 2: i am ping tongue
To compare the query-document1 i will use the words: this is ping kong pong (so 5 coordinates) and to compare the query-document2 i will use the words: i am ping tongue is kong (6 coordinates) and then since i use the same comparing method the one with the highest score is the most relevant? OR do i have to use for both the words: this is ping kong am tongue kong (7 coordinates)? So my question is which is the right way to compare all these 8000 documents with the question? I hope i succeed on making my question easy to understand. thank you for your time!

The FREETEXTTABLE on MS SQL 2012 returns strange ranks

I try to find several words in one table but in different fields.
Why the records with one corresponing word have the rank higher than the records with two ones?
The example:
Record 1
Title: Eddie Murphy
Description: An American stand-up comedian, actor, writer, singer, director, and musician.
Record 2
Title: Tom Cruise
Description: An American film actor and producer. He has won three Golden Globe Awards.
SELECT * FROM FREETEXTTABLE(SubjectContent, (Title, Description), 'tom actor')
returns Recrod 1 with rank 61 and Record 2 with rank 47 despite the record 2 contains both words ('tom' and 'actor') and record 1 contains only one word ('actor'). So the user receives the huge amount of unproper records before the proper one.
Though if I set the search parameter 'tom cruise actor' the request returns the high rank.
My fulltext index:
CREATE FULLTEXT INDEX ON SubjectContent(Title, [Description])
KEY INDEX PK_SubjectContent
ON FullTextSearch;
I unsuccessfully tried to change the property 'accent sensitive' and other properties of Full Text Catalog. Thanks for any help.
Looking at the 2 strings, I see that the second one is a larger document from fulltext point of view. this is because of the sentence separator you have in there. So if you pass these strings to the dm_fts_parser, you will see that the max occurrence of first string is 11 and second one is 21. Fulltext normalizes this document length in buckets of 16, 32, 128, 256 .. etc. so your first document falls in first bucket and the second in second bucket. hence first one has higher rank (inversely proportional to the length of the document). reference of all this is here http://msdn.microsoft.com/en-us/library/cc879245.aspx
Thanks
Venkat

Resources