I have strings that are attributes of a property in my ontology like: "Foo1 hasBar Bar1, Foo2 hasBaz Baz1,..."
What I want to do is to loop through the string turning each triple separated by a comma into an actual triple. BTW, I know the first thought may be "why didn't you just process the data that way with an upload tool like Cellfie" or "call the SPARQL query from a programming language" but for my particular client they would rather just use SPARQL and the ontology is already a given.
I have written a query that does what I want for the first triple and changes the string to remove that triple. E.g., it finds the first triple, turns that into rdf and inserts it into the graph and then changes the original string property to: "Foo2 hasBaz Baz1,..."
So I can just run the query until there are no more strings to process but that's kind of a pain. I've looked through the SPARQL documentation and the examples regarding SPARQL and iteration on this site and I just don't think it is possible given the declarative nature of SPARQL but I wanted to double check. Perhaps if I did something like embed the current query in another query?
Related
I have documents that contain simple arrays of strings, and I can't seem to set up a filter that is capable of bringing back all documents where a given array field has at least one element string that is not "". This is on a collection with 6500 documents, where 3700 should meet the above criteria (I checked by pulling all records and performing the filter client-side).
I am mainly using the driver in .NET, but I've also tinkered with the filter in Compass. Using the driver I've tried Ne, Not(Eq), AnyNe, Not(AnyEq), Nin[""], Not(In[""]). I would like to use ElemMatch, but it seems like that is geared towards arrays of documents not arrays of strings, since you have to specify a field name, which doesn't exist in this case. I've also tried setting up a .Where filter that looped through to find any non-empty string in the array, but it threw an exception at run-time (I'm coding in VB).
Builders(Of BsonDocument).Filter.AnyNe(Of String)("field", String.Empty)
I would expect that the above filter, where "field" is a reference to an array of strings, would bring back 3700 documents but I get 0.
I would imagine that I'm clearly the one missing something here, as this does not seem like it should be a difficult query/filter to construct. Any help would be greatly appreciated.
For anyone as confused as I am:
I mentioned earlier that ElemMatch seems to be geared for arrays of documents, but apparently if you forgo using the Builder classes and type out the query manually you can actually use ElemMatch within .NET to query a simple String Array field where at least one entry is not empty string "".
Correct/Working example: {"[array_field_name]": {$elemMatch: {$ne: ""}}}
If anyone can tell me how to create that example using the Builder classes, that'd be awesome.
While working with search definition which looks like
search music{
document music{
field title type string {
indexing: summary | attribute | index
}
}
}
if I use my custom logic of tokenizing string by developing document processor (I save processed tokens in context of Processing), how to store tokens in the base index? and how they are mapped back to the original content of the field, while recall for a particular query? Do we solve it by ProcessingEndPoint? If yes, how?
First, you should almost certainly drop "attribute" for this field - "attribute" means the text will be stored in a forward store in memory in addition to creating an index for searching. That may be useful for structured data for sorting, grouping and ranking, but not for a free-text field.
Unnecessary details:
You can perform your own document processing by adding document processor components: http://docs.vespa.ai/documentation/docproc-development.html. Token information for indexing are stored as annotations over the text which are consumed by the indexer: http://docs.vespa.ai/documentation/annotations.html
The code doing this in Vespa (called by a document processor) is https://github.com/vespa-engine/vespa/blob/master/indexinglanguage/src/main/java/com/yahoo/vespa/indexinglanguage/linguistics/LinguisticsAnnotator.java, and the annotations it adds, which are consumed during indexing are https://github.com/vespa-engine/vespa/blob/master/document/src/main/java/com/yahoo/document/annotation/AnnotationTypes.java. You'd also need to do the same tokenization at the query side, in a Searcher: http://docs.vespa.ai/documentation/searcher-development.html
However, there is a much simpler way to do this: You can plug in your own tokenizer as described here: http://docs.vespa.ai/documentation/linguistics.html: Create your own component subclassing SimpleLinguistics and override getTokenizer to return your implementation. This will be executed by Vespa as needed both on the document processing and query side.
The reason for doing this is usually to provide linguistics for other languages than english. If you do this, please consider providing your linguistics code back to Vespa.
I'm writing the csv file to train a ranker in Watson Retrieve and Rank service, with many rows [query,"id_doc","relevance_score",...].
I have two questions about the structure of this file:
I have to distinguish two documents, depending on whether or not the query contains the word "not". More specific:
the body and the title of the first document contain "manager"
the body and the title of the second document contain "not manager"
Thus, if the query is like "I'm a manager. How do I....?" then the first document is correct, but not the second one.
if the query is like "I'm not a manager..." then the second document is correct, but not the first one.
Is there any particular syntax that can be used to write the query in a proper way? Maybe using boolean operator? Is this file the right place to apply this kind of filter?
2. This service has also a web interface to train a ranker. The rating used in this site is: 1-> incorrect answer, 2-> relevant to the topic but doesn't answer to the question, 3-> good, but can be improved, 4->perfect answer.
Is the relevance score used in this file the same one of the web interface?
Thank you!
Is there any particular syntax that can be used to write the query in a proper way? Maybe using boolean operator? Is this file the right place to apply this kind of filter?
As you hinted, this file is not quite the appropriate place for using filters. The training data will be used to figure out what types of lexical overlap features the ranker should pay attention to when trying to optimize the ordering of the search results from Solr (see discussion here for more information: watson retrieve-and-rank - manual ranking).
That said, you can certainly add at least two rows to your training data like so:
The first can have the question text "I'm a manager. How do I do something" along with the corresponding correct doc id and a positive integer relevance label.
The second row can have the question text "I'm a not manager. How do I do something" along with the answering doc id for non-managers and a positive integer relevance label.
With a sufficient number of such examples, hopefully the ranker will learn to pay attention to bigram lexical overlap features. If this is not working, you can certainly play with pre-detecting manager vs not manager and apply appropriate filters, but I believe that's done with a separate parameter (fq?)...so you might have to modify train.py to pass the filter query appropriately (the default train.py takes the full query and passes it via the q to the /fcselect endpoint).
Is the relevance score used in this file the same one of the web interface?
Not quite, the web interface uses the 1-4 star rating to improve the UI for data collection, but then compresses the star ratings to a smaller relevance label scale when generating the training data for the ranker. I think the compression gives bad answers (i.e. star ratings < 3) a relevance label of 0 and passes the higher star ratings as is so that effectively there are 3 levels of rating (though maybe someone on the UI team can add clarification on the details if need be). It is important for the underlying ranking algorithm that bad answers receive a relevance label of 0.
I have successfully implemented a Czech lemmatizer for Lucene. I'm testing it with Solr and it woks nice at the index time. But it doesn't work so well when used for queries, because the query parser doesn't provide any context (words before or after) to the lemmatizer.
For example the phrase pila vodu is analyzed differently at index time than at query time. It uses the ambiguous word pila, which could mean pila (saw e.g. chainsaw) or pít (the past tense of the verb "to drink").
pila vodu ->
Index time: pít voda
Query time: pila voda
.. so the word pila is not found and not highlighted in a document snippet.
This behaviour is documented at the solr wiki (quoted bellow) and I can confirm it by debugging my code (only isolated strings "pila" and "vodu" are passed to the lemmatizer).
... The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" seperately, ...
So my question is:
Is it possible to somehow change, configure or adapt the query parser so the lemmatizer would see the whole query string, or at least some context of individual words? I would like to have a solution also for different solr query parsers like dismax or edismax.
I know that there is no such issue with phrase queries like "pila vodu" (quotes), but then I would lose the documents without the exact phrase (e.g. documents with "pila víno" or even "pila dobrou vodu").
Edit - trying to explain / answer following question (thank you #femtoRgon):
If the two terms aren't a phrase, and so don't necessarily come together, then why would they be analyzed in context to one another?
For sure it would be better to analyze only terms coming together. For example at the indexing time, the lemmatizer detects sentences in the input text and it analyzes together only words from a single sentence. But how to achieve a similar thing at the query time? Is implementing my own query parser the only option? I quite like the pf2 and pf3 options of the edismax parser, would I have to implement them again in case of my own parser?
The idea behind is in fact a bit deeper because the lemmatizer is doing word-sense-disambiguation even for words that has the same lexical base. For example the word bow has about 7 different senses in English (see at wikipedia) and the lemmatizer is distinguishing such senses. So I would like to exploit this potential to make searches more precise -- to return only documents containing the word bow in the concrete sense required by the query. So my question could be extended to: How to get the correct <lemma;sense>-pair for a query term? The lemmatizer is very often able to assign the correct sense if the word is presented in its common context, but it has no chance when there is no context.
Finally, I implemented my own query parser.
It wasn't that difficult thanks to the edismax sources as a guide and a reference implementation. I could easily compare my parser results with the results of edismax...
Solution :
First, I analyze the whole query string together. This gives me the list of "tokens".
There is a little clash with stop words - it is not that easy to get tokens for stop words as they are omitted by the analyzer, but you can detect them from PositionIncrementAttribute.
From "tokens" I construct the query in the same way as edismax do (e.g. creating all 2-token and/or 3-token phrase queries combined in DisjunctionMaxQuery instances).
Is there a way to specify a set of terms that are more important when performing a search?
For example, in the following question:
"This morning my printer ran out of paper"
Terms such as "printer" or "paper" are far more important than the rest, and I don't know if there is a way to list these terms to indicate that, in the global knowledge, they'd have more weight than the rest of words.
For specific documents you can use QueryElevationComponent, which uses special XML file in which you place your specific terms for which you want specific doc ids.
Not exactly what you need, I know.
And regarding your comment about users not caring what's underneath, you control the final query. Or, in the worst case, you can modify it after you receive it at Solr server side.
Similar: Lucene term boosting with sunspot-rails
When you build the query you can define what are the values and how much these fields have weight on the search.
This can be done in many ways:
Setting the boost
The boost can be set by using "^ "
Using plus operator
If you define + operator in your query, if there is a exact result for that filed value it is shown in the result.
For a better understanding of solr, it is best to get familiar with lucene query syntax. Refer to this link to get more info.