Can I match important Keywords in a string? - vespa

Consider a user inputs this search string to a news search engine:
"Oops, Donald Trump Jr. Did It Again (Wikileaks Edition) :: Politics - Paste"
Imagine we have a database of News Titles, and a database of "Important People".
The goal here is: If a Search string contains an Important person, then return results containing this "substring" with higher ranking then those resutls that do NOT contain it.
Using the Yahoo Vespa Engine, How can I match a database full of people names against long news title strings ?
*I hope that made sense, sorry everyone, my english not so good :( Thank you !

During document processing/indexing of news titles you could extract named entities from the input text using the "important people" database. This process could be implemented in a custom document processor. See http://docs.vespa.ai/documentation/document-processing-overview.html).
A document definition for the news search could look something like this with a custom ranking function. The document processor reads the input title and populates the entities array.
search news {
document news {
field title type string {
indexing: summary | index
}
field entities type array<string> {
indexing: summary | index
match: word
}
}
rank-profile entity-ranking {
first-phase {
expression: nativeRank(title) + matches(entities)
}
}
At query time you'll need to do the same named entity extraction from the query input and built a Vespa query tree which can search the title (e.g using OR or WeakAnd) and also search the entities field for the possible named entities using the Vespa Rank operator. E.g given your query example the actual query could look something like:
select * from sources * where rank(title contains "oops" or title
contains "donald" or title contains "trump", entities contains "Donald Trump Jr.");
You can build the query tree in a custom searcher http://docs.vespa.ai/documentation/searcher-development.html using a shared named entity extraction component.
Some resources
Shared components & writing custom searchers/documentprocesors (To implement the named entity extraction) http://docs.vespa.ai/documentation/jdisc/container-components.html
Ranking http://docs.vespa.ai/documentation/ranking.html
Query language http://docs.vespa.ai/documentation/query-language.html

Related

Neo4j - How to use createRelationshipIndex / createNodeIndex in full-text search

So I understand that Neo4j 3.5 and above implements full-text search in cypher query via createNodeIndex(), e.g.:
CALL db.index.fulltext.createNodeIndex("myIndex", ["PersonNode"], ["name"])
where myIndex is an arbitrary variable I make up to store the index, PersonNode is the name of my Node label, and name is one of the attributes of PersonNode where I want the full-text search performed.
And to actually perform the search by name, I can do something like the following:
CALL db.index.fulltext.queryNodes("myIndex", "Charlie")
But now assume that PersonNode has a relationship of type PURCHASED_ITEM, which is connected to another node label ProductNode as follows:
PersonNode-[:PURCHASED_ITEM]->ProductNode
And assume further that ProductNode has an attribute called productTitle indicating the display title name for each product.
My question is, I would like to set up an index for this relationship (using, presumably, createRelationshipIndex()), and perform a full-text search by productTitle and return a list of all PersonNode that purchased the given product. How can I do this?
Addendum: I understand that the above could be done by first getting a list of all ProductNode instances matching the given title, then performing a normal cypher query to extract all related PersonNode instances. I also understand that for the above example, a normal cypher query would be all that I need. But the reason I'm asking this question is that I eventually need to implement a single search bar that would allow the user to input any text, including possible misspellings and all, and have it perform a search through multiple attributes and/or relationships of PersonNode, and the results need to be sorted by some kind of relevance score. And in order to do this, I feel I need to first grasp exactly how the relationship queries work in neo4j.
Here is an example of how to create a full-text index for the productTitle property of PURCHASED_ITEM relationships:
CALL db.index.fulltext.createRelationshipIndex("myRelIndex", ["PURCHASED_ITEM"], ["productTitle"])
And here is a snippet showing the use of that index:
CALL db.index.fulltext.queryRelationships("myRelIndex", "Hula Hoop") YIELD relationship, score
...
product title is the property of product node not the purchased item

Index structure for azure search

I'm putting together a query to index medicines. A user should be able to enter their search term into a single search box. Their search term might be either a brand name for a drug, a generic name (the underlying compound on which all brands are based) or an indication and they should be returned a list of medicines that correspond to their search. I'd like to have a category facet for the type - either indication, brand or generic.
To have a category facet, my understanding is that I'd have to send my data through as one row per search term where that search term might be a brand, indication or a generic, rather than one row per brand with columns for generic list and indication. Is this correct or is there another way to get at what I'm wanting to do?
I hope I understand your ask here. From the screenshot you provided, I would assume what you would want to do is make the field "MedicineInformationType" a Facetable field in your Azure Search index and make the field "SearchTerm", "Product", "GenericList", and "ActionList" all Searchable fields in your Azure Search index (although I am not sure why you would want the "SearchTerm" field if the term in this field is already in one of the other fields).
If you structure your index this way, you can do a search for say "phosphate" and facet over the "MedicineInformationType" field to get a count of the results that are generic or brands.
For example (as a REST call):
search=phosphate&facet=MedicineInformationType

How processed tokens get stored in base index in Vespa?

While working with search definition which looks like
search music{
document music{
field title type string {
indexing: summary | attribute | index
}
}
}
if I use my custom logic of tokenizing string by developing document processor (I save processed tokens in context of Processing), how to store tokens in the base index? and how they are mapped back to the original content of the field, while recall for a particular query? Do we solve it by ProcessingEndPoint? If yes, how?
First, you should almost certainly drop "attribute" for this field - "attribute" means the text will be stored in a forward store in memory in addition to creating an index for searching. That may be useful for structured data for sorting, grouping and ranking, but not for a free-text field.
Unnecessary details:
You can perform your own document processing by adding document processor components: http://docs.vespa.ai/documentation/docproc-development.html. Token information for indexing are stored as annotations over the text which are consumed by the indexer: http://docs.vespa.ai/documentation/annotations.html
The code doing this in Vespa (called by a document processor) is https://github.com/vespa-engine/vespa/blob/master/indexinglanguage/src/main/java/com/yahoo/vespa/indexinglanguage/linguistics/LinguisticsAnnotator.java, and the annotations it adds, which are consumed during indexing are https://github.com/vespa-engine/vespa/blob/master/document/src/main/java/com/yahoo/document/annotation/AnnotationTypes.java. You'd also need to do the same tokenization at the query side, in a Searcher: http://docs.vespa.ai/documentation/searcher-development.html
However, there is a much simpler way to do this: You can plug in your own tokenizer as described here: http://docs.vespa.ai/documentation/linguistics.html: Create your own component subclassing SimpleLinguistics and override getTokenizer to return your implementation. This will be executed by Vespa as needed both on the document processing and query side.
The reason for doing this is usually to provide linguistics for other languages than english. If you do this, please consider providing your linguistics code back to Vespa.

Solr store and search ordinal numbers with suffixes

Suppose I have to store student academic details like...
College name -- text field searchable
Student Class -- text field searchable
Subjects -- multivalue field , text field searchable
How do I store/handle "Student class" because student can search like this "students of class 4th" , "Students of class 4" , "student of class fourth"
How Can I handle these (4th, 4, fourth) variations? What are elegant ways to do so.
Thanks
Amit Aggarwal
One way to solve this problem is to use a field type that supports query time synonyms. Check out the "text_general" type in the example solr schema.
In practice you would add rows like this to the synonyms.txt file in your cores conf dir:
# numbers
1,1st,first
2,2nd,second
3,3rd,third
4,4th,fourth
Now, lets suppose you had a document such as:
{ "college":"Princeton", "class":"1", "subjects":["CS 101", "introduction to full text search"]}
You could then retrieve that document if you do a query such as:
class:first
In this example the search query is directed towards one field, which may or may not be what you want. If you need to target the search query with number synonym matching into multiple fields( ie, search query with no field specifier, just the search term), you could copy all those fields content into a single synonym searchable field (using copyField) such as content_synonyms and then run the query against this field by default.

Implementing keyword search on Google App Engine?

I'm trying to implement a keyword/tags search for a certain entity type in GAE's datastore:
class Docs(db.Model):
title = db.StringProperty()
user = db.StringProperty()
tags = db.StringListProperty()
I also wrote a very basic search function (using a fake list of tags, not the datastore values), that takes a query string and matches it to each set of tags. It ranks Docs based on how many query words match the tags. This is pretty much all I need it to do, except using the actual datastore.
I have no idea how to get the actual datastore values though. For my search function to work I need a list of all the entities in the datastore, which is impossible (?).
I also tried looking into GAE's experimental full-text search, and Relation Index Entities as a way to search the datastore without using the function I wrote. Neither was successful.
Any other ideas on how to search for entities based on tags?
It's a very simple query, if you need to find all Docs with a tag "findme", it's simply:
num_results = 10
query = Docs.all().filter("tags in", "findme")
results = query.fetch(num_results) # get list of results
It's well documented:
https://developers.google.com/appengine/docs/python/datastore/queries

Resources