Vector Space Model query - set of documends search - c

i'm trying to write a code for vsm search in c. So using a collection of documents i built a hashtable (inverded index) in wich each slot holds a word along with it's df and a pointer to a list in which each slot hold a name of a document(in which the word appeared at least once) along with the tf(how many times it appeared in this doccument). The user will write a question(also chooses weighting qqq.ddd and comparing method but that doesn't matter for my question) and i have to print him the documents that are relevant to it(from the most relevant to the least relevant). So the examples i've seen are showing which are the steps having only one document for example: we have a collection of 1.000.000 documents(N=1.000.000) and we want to compare
1 document: car insurance auto insurance
with the queston: best car insurance
So in the example it creates an array like this:
Term | Query | Document
| tf | tf
auto | 0 | 1
best | 1 | 0
car | 1 | 1
insurance| 1 | 2
The example also gives the df for each term so using these clues and the weighting and comparing methods it's easy to compare them turning them into vectors by finding the 4 coordinates(1 for each word in the array).
So in this example there are 1.000.000 documents and to see how relevant the document with the query is we use 1 time each(4 words) of the words that there are in the query and in the document. So we have to find 4 coordinates and then compare.
In what i'm trying to do there are like 8000 documents each of them having from 3 to 50 words. So how am i suppose to compare how relevant is a query with each document? If i have
a query: ping pong
document 1: this is ping kong
document 2: i am ping tongue
To compare the query-document1 i will use the words: this is ping kong pong (so 5 coordinates) and to compare the query-document2 i will use the words: i am ping tongue is kong (6 coordinates) and then since i use the same comparing method the one with the highest score is the most relevant? OR do i have to use for both the words: this is ping kong am tongue kong (7 coordinates)? So my question is which is the right way to compare all these 8000 documents with the question? I hope i succeed on making my question easy to understand. thank you for your time!

Related

SQL Server - Query String Greater Or Equal To

I am attempting to optimise a query in my application that is causing problems when scaling my application.
The table contains two columns: FROM and TO which each contain values. Here is an example:
Row | From | To
1 | AA | Z
2 | B | C
3 | JA | JZ
4 | JM | JZ
The query is passed a name (JOHN) and should return a list of ranges from the table that could contain the name.
select * from Ranges where From <= 'JOHN' and To >= 'JOHN'
Using the table above this would result in rows 1 and 3 being returned.
The problem I am having is one of query consistency.
All indexes are in place but if I search for JOHN the query returns in 20 milliseconds, whereas MARK returns in 250 milliseconds.
Looking at query analyzer shows me that JOHN is actually searching for more rows than MARK but I'm struggling to understand how or why MARK takes so long.
If the time difference was 20 - 40 milliseconds, I could live with that but 250 is so large a difference that the overall performance of my application is terrible.
Does anybody have any idea how I could narrow down why I get such variance in my queries OR a better way of storing and searching for string ranges (which could contains letters and numbers).
Many thanks in advance.
EDIT - One thing I forgot to mention was that the original table contains approximately 15 million rows (its actually postcodes).

Solr max results for particular field type

I am using Solr 4.10.3.
I have various fields in schema. e.g id,title,content,type etc. I have docs scenario such that many docs have same type value.
id | title | content | type
1 | pro | My | abc
2 | ver | name | ht
3 | art | is | abc
and so on.
When I query Solr, I want total 10 results(as default) but in them only maximum two of type:abc. Rest of the 8 results can be of any type except abc and can be more of one type.
Is there any possible solution.?
Make two queries, once with rows=2 and type:abc, and second time with rows=8 and -type:abc. Rest of the query can be identical. Then combine the results before you show them to users.
EDIT: After some research on what comes next in Solr features, I believe that combining the results will become possible once the streaming expressions are part of Solr (maybe in 5.2). See https://issues.apache.org/jira/browse/SOLR-7377
Starting with Solr 5.2 you can accomplish this using the Streaming Expressions (and the Streaming API under the covers). The expression would be
top(n=10,
merge(
top(n=2,
search(fl="id, title, content, type", q="type:abc", sort="title asc")
),
search(fl="id, title, content, type", q="-type:abc", sort="title asc")
on="title asc"
)
)
What's going on here is that first you're finding at most two documents of type "abc", then you are finding all documents that are not of type "abc". Notice that both of these are sorted by "title asc" (it's important that they be sorted by the same field(s)). Then we merge these two streams on the field "title" in ascending order. What this does is, using the title field, pick the first document from one of the streams, then it picks the second document, then the third, and so on. By merging "on title" what we're doing is using the title field to decide which document comes first in the merged stream. The outer top will ensure that only 10 documents are returned. Because we limited the "type:abc" stream to have at most 2 documents the final merged stream will have at most 2 "type:abc" documents.
Obviously you can change the sort criteria to whatever is best for you but it is critical with a merge that the incoming streams are sorted in the same order. And in fact, the Streaming API will validate that for you and reject the expression if that is not the case.

Relevance and Solr Grouping

Say I have the following collection of webpages in a Solr index:
+-----+----------+----------------+--------------+
| ID | Domain | Path | Content |
+-----+----------+----------------+--------------+
| 1 | 1.com | /hello1.html | Hello dude |
| 2 | 1.com | /hello2.html | Hello man |
| 3 | 1.com | /hello3.html | Hello fella |
| 4 | 2.com | /hello1.html | Hello sir |
...
And I want a query for hello to show results grouped by domain like:
Results from 1.com:
/hello1.html
/hello2.html
/hello3.html
Results from 2.com:
/hello1.html
How is ordering determined if I sort by score? I use a combination of TF/IDF and PageRank for my results normally, but since that calculates scores for each individual item, how does it determine how to order the gruops? What if 1.com/hello3.html and 1.com/hello2.html have very low relevance but two results while 2.com/hello1.html has really high relevance and only one result? Or vice versa? Or is relevance summed when there are multiple items in a grouping field?
I've looked around, but haven't been able to find a good answer to this.
Thanks.
It sounds to me like you are using Result Grouping. If that's the case, then the groups are sorted according to the sort parameter, and the records within each group are sorted according to the group.sort parameter. If you sort the groups by sort=score desc (this is the default, so you wouldn't actually need to specify it), then it sorts the groups according to the score of each group. How this score is determined isn't made very clear, but if you look through the examples in the linked documentation you can see this statement:
The groups are sorted by the score of the top document within each group.
So, in your example, if 2.com's hello1.html was the most relevant document in your result set, "Results from 2.com" would be your most relevant group even though "Results from 1.com" includes three times the document count.
If this isn't what you want, your best options are to provide a different sort parameter or result post-processing. For example, for one project I was involved in, (where we had a very modest number of groups,) we chose to pull the top three results for each group and in post processing we calculated our own sort order for the groups based on the combination of their scores and numFound values. This sort of strategy might have been prohibitive for cases with too many groups, and may not be a good idea if the more numerous groups run the risk of making the most relevant documents harder to find.

Cassandra/Solr data model improvement

I have the following table:
CREATE TABLE videos_tags (
id text,
tag text,
video text,
someotherfield long,
PRIMARY KEY (id),
) WITH gc_grace_seconds = 1296000
AND compaction={'class': 'LeveledCompactionStrategy'}
AND compression={'sstable_compression': 'LZ4Compressor'};
The table stores a list of tags and videos. A video can have one or more tags; and a tag can be attributed to more than one video. Example:
id | tag | video
------------------------------------------
1 | dancing | video1
2 | singing | video2
3 | prank | video3
4 | prank | video4
5 | funny | video3
6 | cover | video2
I want to show to my users a list of related videos based from tag assignment - the more tags a certain video has in common with the user's video, the more "related" it is. The actual approach that I use comprises of 2 steps:
Get a list of the user's video's tags
q=:&fq=video:video1&fl=tag
Identify the videos use the same tags as the user's video and select the top 10 (resultset slicing is done in application side)
q=:&fq=tag:tag1 AND tag:tag2 AND tag:tag3 AND !video:video1&fl=video&stats=true&stats.field=someotherfield&stats.facet=video
Note: I used stats instead of plain facet because I also need the sum of someotherfield
This approach yields an average execution time of 30 seconds. Unfortunately, the maximum acceptable query time for my app is 10 seconds
Is there a better approach to tackling this data requirement? I'm open to:
Alternative query approach (minor tweaks are preferred; but I can accept something as drastic as replacing my 2-step approach completely)
Alternative schema
Notes:
The actual schema has several other fields that I removed from this post for brevity
I do all read operations via Solr (Datastax Enterprise 4.6.0). Nothing fancy in the Solr schema
The table currently holds 1.5 billion rows, but could grow to double or triple of that within years (so the solution must take into account the table/index size)
No fulltext search - only exact string filters

(Full-Text) Search And Database Design

This is a system architecture question on designing full-text search with (relational) database. The specific software I'm using are Solr and PostgreSQL, just FYI.
Suppose we are building a forum with two users Andy and Betty --
Post ID | User | Title | Content
--------|-------|-------------------|---------------------------
1 | Andy | Dark Knight rocks | Dark Knight rocks blah
2 | Betty | I love Twilight | Twilight blah blah
3 | Andy | Twilight sucks | Twilight sucks blah
4 | Betty | Andy sucks | Twilight rocks, Andy sucks
When the posts table is indexed in Solr, we can easily return the posts sorted by their relevancy to "?q=twilight" or "?q=dark+night".
Now we want to add a new feature to search for users instead of posts. A naive implementation would simply index user name and return "Andy" to "?q=a" and "Betty" to "?q=b", but what if we want to make our system smarter to also take into account of the user posts and return "Betty" before "Andy" to "?q=twilight" because Betty mentions Twilight more than Andy does.
How would you design the system to efficiently handle the user-search function for hundreds of thousands of users and millions of posts?
Faceting on User would return number of results per user. If Andy wrote 15 posts that match Twilight while Betty wrote 10, the faceting will return them as such.
But it wont help if both wrote 15 posts about Twilight, but Andy's was supposed to be more relevant; you will see all facet counts (15, 15 in this case) even if you are paginating to see only (say,) top 5 results and Andy made 4 of them.
If above solution is not good enough, consider a background job that writes documents of
type: suggest_user_type (so you can distinguish them by a `fq`)
user: Andy (the user)
concatted_posts: "I think Twilight.." (concatenate the users latest 50 posts)
once a week. And if you
fq=type:suggest_user_type&
q=concatted_posts:twilight&
fl=user
you get a sorted list of users based on relevance of concatted_posts with respect to twilight.
I believe term frequency is included in full text search ranking. It's part of a research area called information retrieval. There's also another value called the inverse document frequency, which filters out common terms.
There are other steps common to ranking text, you may want to have a look at the OpenNLP project if you're interested.
In terms of database design, there's too much to cover in a post and I'm not the one to write it. The general consensus seems to be for very large systems they key is building an efficient index, then distributing this over a number machines to scale performance. I would recommend reading up on Page Rank and how Google developed its systems as a starting point.

Resources