How to do Full Text combined w/ Geo Spatial in Solr 7?
In regards to this: https://lucene.apache.org/solr/guide/7_2/spatial-search.html
I have to do queries that COMBINE full text w/ geo spatial. For example:
box AND full text or spatial sort AND full text.
I was not able to figure out a good query string example that produces this desired result. I would like this as a pure query string rather than some Java method as I'm consuming this on different tech other than Java. Solr is very deep and confusing and I know I must read more but there was no good examples for this anywhere online that I found.
desired query string example
[solr]/select?fq={!bbox sfield=point}&pt=34.04506799999999,-118.260849&d=10000&sort=geodist() asc&{!geofilt}&sfield=point&q=Panini
So in that case, would sort by distance yet also filter by some full text "some text" value.
If this cannot be done, I believe it is possible in Elastic Search but these (Solr and Elastic Search) are both built on top of Lucene so seems like it should work on both if works on one but feel free to supply an answer for Elastic Search as well.
example returned
{
"responseHeader": {
"status": 0,
"QTime": 2,
"params": {
"q": "Panini",
"pt": "34.04506799999999,-118.260849",
"d": "10000",
"{!geofilt}": "",
"fq": "{!bbox sfield=point}",
"sort": "geodist() asc",
"sfield": "point"
}
},
"response": {
"numFound": 0,
"start": 0,
"docs": []
}
}
Docs do contain this phrase 'Panini' but none returned. May be due to default way full text is handled in Solr 7? It is using the same point where the term 'Panini' is used and the field point is of type org.apache.solr.schema.PointType.
UPDATE
I ended up abandoning Solr for Elastic Search. Solr is just very annoying in its strange ways compared with the very easy to use Elastic Search. Things just work as you expect without having to dig into quirks.
I adapted my answer to the solr 7.2.1 example:
Start solr by: ./bin/solr start -e techproducts
I've also visualized the data in google maps:
https://www.google.com/maps/d/u/0/viewer?ll=42.00542239270033%2C-89.81213734375001&hl=en&hl=en&z=4&mid=16gaLvuWdE9TsnhcbK-BMu5DVYMzR9Vir
You need these query parameters:
Bound by Box Filter:
fq={!bbox}
The geo filter query parser bbox needs further parameters:
Solr field: sfield=store
Point to search/sort from: pt=36.35,-97.51
Distance for filter: d=1200
Sort:
sort=geodist() asc
Fulltext query:
q=some+text
Full example queries for solr example data:
Simple:
http://localhost:8983/solr/techproducts/select?fq={!bbox}&sort=geodist()%20asc&sfield=store&pt=36.35,-97.51&d=1200&q=ipod&fl=name,store
UI:
http://localhost:8983/solr/techproducts/browse?fq={!bbox}&sort=geodist()%20asc&sfield=store&pt=36.35,-97.51&d=1200&q=ipod
The result is as expected:
Apple 60 GB iPod
Belkin Power Cord for iPod
Filtered by distance: iPod & iPod Mini USB 2.0 Cable
Hints
The field store must be of type location:
You might Urlencode the special characters:
e.g. fq=%7B%21bbox%20sfield%3DgeoLocation%7D
In your case, you have to combine the full-text search scoring with the spatial distance.
So if your query looks like this:
/select?fq={!bbox sfield=point}&pt=34.04506799999999,-118.260849&d=10000&sort=geodist() asc&{!geofilt}&sfield=point&q=Panini
You should change the sort parameter and either remove it or just set it to score desc. That way you sort by the score given from the full-text search query.
To take the spatial part into consideration you need to include a boosting function to your query. In majority of the cases - the closer the document is from the point of interest the better, so you would probably like to include a boosting function that does X/distance. The X can be as simple as 1 and the function itself can also be more complicated. To do that in dismax query you would use the bf parameter, like bf=div(1,geodist()).
Try that out, it should work, but of course will need some adjustments.
Related
I'd like to use the new Semantic Knowledge Graph capability in Solr to answer this question:
Given a set of documents from several different publishers, compute a "relatedness" metric between a given publisher and every other publisher, based on the text content of their respective documents.
I've watched several of Trey Grainger's talks regarding the Semantic Knowledge Graph functionality in Solr (this is a great recent one: https://www.youtube.com/watch?v=lLjICpFwbjQ) I have a reasonably good understanding of Solr faceted search functionality, and I have a working Solr engine with my dataset indexed and searchable. So far I've been unable to construct a facet query to do what I want.
Here is an example curl command which I thought might get me what I want
curl -sS -X POST http://localhost:8983/solr/plans/query -d '
{
params: {
fore:"publisher_url:life.church"
back:"*:*",
},
query:"*:*",
limit: 0,
facet:{
pub_type: {
type: terms,
field: "publisher_url",
limit: 5,
sort: { "r1": "desc" },
facet: {
r1: "relatedness($fore,$back)"
}
}
}
}
}'
Below are the result facets. Notice that after the first bucket (which matches the foreground query), the others all have exactly the same relatedness. Which leads me to believe that the "relatedness" is only based on the publisher_url field rather than the entire text content of the documents.
{
"facets":{
"count":2152,
"pub_type":{
"buckets":[{
"val":"life.church",
"count":141,
"r1":{
"relatedness":0.38905,
"foreground_popularity":0.06552,
"background_popularity":0.06552}},
{
"val":"10ofthose.com/us/products/1039/colossians",
"count":1,
"r1":{
"relatedness":-0.00285,
"foreground_popularity":0.0,
"background_popularity":4.6E-4}},
{
"val":"14DAYMARRIAGECHALLENGE.COM",
"count":1,
"r1":{
"relatedness":-0.00285,
"foreground_popularity":0.0,
"background_popularity":4.6E-4}},
{
"val":"23blast.com",
"count":1,
"r1":{
"relatedness":-0.00285,
"foreground_popularity":0.0,
"background_popularity":4.6E-4}},
{
"val":"2911worship.com",
"count":1,
"r1":{
"relatedness":-0.00285,
"foreground_popularity":0.0,
"background_popularity":4.6E-4}}]}}}
I'm not very familiar with the relatedness function, but as far as I understand, the relatedness score is generated from the similarity between your foreground and background set of documents for that facet bucket.
Since your foreground set only contain that single value (and none of the other), the first bucket is the only one that will generate a different similarity score when you're faceting for the same field as you use for selecting documents.
I'm not sure if your use case is a good match for what you're trying to use, as relatedness would indicate that single terms in a field is related between the two sets you're using, and not a similarity score across a different field for the two comparison operators.
You probably want something more structured than a text field to generate relatedness() scores, as that's usually more useful for finding single values that generate statistical insight into the structure of your query set.
The More Like This functionality might actually be a better match for getting the most similar other sites instead.
Again, this is based on my understanding of the functionality at the moment, so someone else can hopefully add more details and correct me as necessary.
I'm trying to get "significant terms" for a subset of documents in Solr. This may or may not be the best way, but I'm currently attempting to use Solr's TF-IDF functionality since we have the data stored in Solr and it's lightning fast. I want to restrict the "DF" count to a subset of my documents, through a search or a filter. I tried this, where I'm searching for "apple" in the name field:
http://localhost:8983/solr/techproducts/tvrh?q=name:apple&tv.tf=true&tv.df=true&tv.tf_idf=true&indent=on&wt=json&rows=1000
and that of course, only gives me documents that have "apple" in the name, but my document frequency gives the counts from the entire dataset, which doesn't seem like what I want. I would think Solr can do this, but maybe not. I'm open to suggestions.
Thanks,
Adrian
It is one the works I have in my backlog[1].
What you need is actually the document frequency in your foreground set ( your subset of docs) and the document frequency in your background set(your corpus).
Solr won't do that out of the box, but you can work on it.
Elastic Search has a module for that you can inspiration from[2]
[1] https://issues.apache.org/jira/browse/SOLR-9851
[2] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html
All:
I wonder if there is any way that we can use lucene to do search keyword relevancy discovering based on search history?
For example:
The code can read in user search string, parse it, extract the keyword and find out which words have most possibility to come together when search.
When I try Solr, I found that the lucene has a lot of text analysis feature, that is why I am wondering if there is any way we can use it and combine with other machine learning libs(if necessary) to achieve my goal.
Thanks
Yes and No.
Yes.
It should work. Simply treat every keyword as a document and then use MoreLikeThis feature of lucene, which constructs a lucene query on the fly based on terms within the raw query. The lucenue query is then used to find other similar documents (keywords) in the index.
MoreLikeThis mlt = new MoreLikeThis(reader); // Pass the index reader
mlt.setFieldNames(new String[] {"keywords"}); // specify the field for similarity
Query query = mlt.like(docID); // Pass the doc id
TopDocs similarDocs = searcher.search(query, 20); // Use the searcher
if (similarDocs.totalHits == 0)
// Do handling
}
Suppose in your indexed keywords, you have such keywords as
iphone 6
apple iphone
iphone on sale
apple and fruit
apple and pear
when you launch a query with "iphone", I am sure you will find the first three keywords above as "most similar" due to the full term match of "iphone".
No.
The default similarity function in lucene never understands that iphone is relevant to Apple Inc, thus iphone is relevant to "apple store". If your raw query is just "apple store", an ideal search result within your current keywords would be as follows (ordered by relevancy from high to low):
apple iphone
iphone 6
iphone on sale
unfortunately, you will get below results:
apple iphone
apple and fruit
apple and pear
The first one is great however the other two are totally unrelated. To get the real relevancy discovery (using the semantic) , you need more work to do topic modeling. If you happen to have a great way (e.g., a pre-trained LDA model or wordvec ) to pre-process each keyword and produce a list of topic ids, you can store those topic ids in a separate field with each keyword document. Something like below:
[apple iphone] -> topic_iphone:1.0, topic_apple_inc:0.8
[apple and fruit] -> topic_apple_fruit:1.0
[apple and pear] -> topic_apple_fruit:0.99, topic_pear_fruit:0.98
where each keyword is also mapped to a few topic ids with weight value.
At query time, you should run the same topic modeling tool to generate topic ids for the raw query together with its terms. For example,
[apple store] -> topic_apple_inc:0.75, topic_shopping_store:0.6
Now you should combine the two fields (keyword and topic) to compute the overall similarity.
Is it possible to make a proximity query in solr while using OR operator, example:
term1:"Samsung"
term2:"galaxy note" OR "galaxy 3" OR ...
i need to find if i have both term1 and term2 with ~n proximity.
This is not supported by default in Solr 4.4 which is currently the latest version.
SOLR-1604 has some relevant information.
The easy approach here would be to just generate a set of longer phrase queries, so rather that looking for "Samsung" and "galaxy note" separately, you concatenate them together, to give you something more like:
"Samsung galaxy note" OR "Samsung galaxy 3" ...
and setting it to have the desired amount of slop.
You can also use ComplexPhraseQueryParser which handles similar cases, though I'm not sure it handles alternate phrases within a phrases. SurroundQueryParser also supports this sort of thing, but doesn't do any analysis, which means you'll have to analyze your subqueries before hand (see the Limitations section in the documentation).
If you want to manually generate queries, you can support this more directly using a set of SpanQuerys, which would look something like:
Query rootQuery = parser.parse("\"Samsung\"");
Query option1Query = parser.parse("\"galaxy note\"");
Query option2Query = parser.parse("\"galaxy 3\"");
Query option3Query = parser.parse("\"something else\"");
SpanQuery orQuery = new SpanOrQuery(
new SpanMultiTermQueryWrapper(option1Query),
new SpanMultiTermQueryWrapper(option2Query),
new SpanMultiTermQueryWrapper(option3Query)
);
SpanQuery finalQuery = new SpanNearQuery({new SpanMultiTermQueryWrapper(rootQuery), orQuery}, slop, true);
I'm using SolrNet to access a Solr index where I have a multivalue field called "tags". I want to perform the following pseudo-code query:
(tags:stack)^10 OR (tags:over)^5 OR (tags:flow)^2
where the term "stack" is being boosted by 10, "over" is being boosted by 5 and "flow" is being boosted by 2. The result I'm after is that results with "stack" will appear higher than those with "flow", etc.
The problem I'm having is that say "flow" only appears in a couple of documents, but "stack" appears in loads, then due to a high idf value, documents with "flow" appear above those with "stack".
When this was project was implemented straight in Lucene, I used ConstantScoreQuery and these eliminated the idf based the score solely on the boost value.
How can this be achieved with Solr and SolrNet, where I'm effectivly just passing Solr a query string? If it can't, is there an alternative way I can approach this problem?
Thanks in advance!
Solr 5.1 and later has this built into the query parser syntax via the ^= operator.
So just take your original query:
(tags:stack)^10 OR (tags:over)^5 OR (tags:flow)^2
And replace the ^ with ^= to change from boosted to constant:
(tags:stack)^=10 OR (tags:over)^=5 OR (tags:flow)^=2
I don't think there any way to directly express a ConstantScoreQuery in Solr, but it seems that range and prefix queries use ConstantScoreQuery under the hood, so you could try faking a range query, e.g. tags:[flow TO flow]
Alternatively, you could implement your own Solr QueryParser.