SOLR Search. Search for additional documents after initial match - solr

I have the following problem to solve.
Client sends the id of the document. This is an HTTP Get to a proxy (not directly to SOLR). Example:
baseURL/movies/{id}
The response of this call will be a list of variants of this movie.
In order to find the variants I want to perform a SOLR search using title and some other fields, e.g.
/movies/select?q=title:spiderman+year:2001
it will expect the different variants of Spiderman e.g. SpiderMan, Spiderman HD. etc
The problem I have now is that the proxy service will not have the title of the original movie. It will get only the id of this movie for the API.
My approach so far is to get the original movie information using the id,
e.g.
/movies/select?id={id}
After I get the original movie then I perform a second request to SOLR search for the variants.
Any ideas how to avoid the two calls to SOLR search?

Related

Why dismax q.alt doesn't return any result

I'm new to solr.
After following the tutorial exercise 1(https://solr.apache.org/guide/8_9/solr-tutorial.html), I'm able to do some solr query on my loacl machine.
If I want to get result without condition, I will do the query like
http://127.0.0.1:8983/solr/#/techproducts/query?q=*:*&q.op=OR
This works pretty fine.
But when I switch to "dismax" and try to have similar result, I do need to use "q.alt".
The query is like
http://127.0.0.1:8983/solr/#/techproducts/query?q.op=OR&defType=dismax&q.alt=*:*
However, this query resulted in no result, which is pretty weird.
Even thought I specified the row, it still won't work.
http://127.0.0.1:8983/solr/#/techproducts/query?q.op=OR&defType=dismax&q.alt=*:*&row=0
Does anyone face the same problem before?
These parameters are not meant to be used with the user interface URLs; they're for sending directly to Solr. The user interface is a Javascript interface that talks to the Solr API behind the scenes. You can see that your urls have a local anchor in them (#), and this is just references that the javascript based user interface uses to load the correct page.
The rows parameter is also named rows, not row - and when used with 0, no documents will be returned (in the example it's given as an example for using facets with complete counts - you have to ask for facets for that to make sense).
The actual URL to query Solr for matching documents would be:
http://127.0.0.1:8983/solr/techproducts/select?defType=edismax&q.alt=*:*
This URL is shown in the user interface over the query results when using the query page.
There is also usually no reason to use dismax and not edismax these days, as edismax does everything that the old dismax handler did and with more functionality.

How to store this information in the Solr search engine?

I built a communicator, and now I want to add the Solr search engine to it.
Users create conversations, and every single conversation contains one or more messages. Messages are store as nodes in a tree. For example:
1. initial message
1.1 reply
1.2 another reply for initial message
1.2.1 bla bla bla...
1.2.2 Lorem ipsum dolorem...
1.3 third reply for initial message
There is always exactly one initial message.
I want to store in Solr content of all messages. I'm thinking about store data in this way:
{
"conversationId_s_lower": <conversation id here>,
"messageId_s_lower": <message id here>,
"content_txt_en": <message content here>
}
But I need to index and make searching also in properties of conversation:
{
conversationTitle_txt_en: "...",
conversationAccessUsersId: [123, 45, ...],
....
}
So the question is: how should I index this data, and how should I make queries?
Some questions to ask before you start designing. From solr perpective, you search for documents by giving a search term. So in your case, what do you consider a document to be. Is it the conversation or an individual message. Mostly a document is analogous to entity. So here I suppose a conversation. So it has an ID.
Next is each conversation is having multiple messages. I can see there are multiple levels to this message hierarchy. Do you want to maintain that? Or is it that all messages are considered to be under one level.
Then querying part - When you search, are you expecting messages or conversations count. This is decided anyways when you design entity as above.
Once you answer these questions, you can move to denormalizing or nested entities(in your case messages are nested under converations). With answers to the above, the rest of the process can be found on any solr article to index documents. let me know if you need any further information. Happy Designing and coding

Deleting solr documents from Solr Admin

How do I delete all the documents in my SOLR index using the SOLR Admin.
I tried using the url and it works but want to know if the same can be done using the Admin..
Use one of the queries below in the Document tab of Solr Admin UI:
XML:
<delete><query>*:*</query></delete>
JSON:
{'delete': {'query': '*:*'}}
Make sure to select the Document Type drop down to Solr Command (raw XML or JSON).
Update: newer versions of Solr may work better with this answer: https://stackoverflow.com/a/48007194/3692256
My original answer is below:
I'm cheating a little, but not as much as writing the query by hand.
Since I've experienced the pain of accidental deletions before, I try to foolproof my deletions as much as possible (in any kind of data store).
1) Run a query in the Solr Admin Query screen, by only using the "q" parameter at the top left. Narrow it to the items you actually want to delete. For this example, I'm using *:*, but you can use things like id:abcdef or a range or whatever. If you have a crazy complex query, you may find it easier to do this multiple times, once for each part of the data you wish to delete.
2) On top of the results, there is a grayed out URL. If you hover the mouse over it, it turns black. This is the URL that was used to get the results. Right (context) click on it and open it in a new tab/window. You should get something like:
http://localhost:8983/solr/my_core_name/select?q=*%3A*&wt=json&indent=true
Now, I want to get it into a delete format. I replace the select?q= with update?commit=true&stream.body=<delete><query> and, at the end, the &wt=json&indent=true with </query></delete>.
So I end up with:
http://localhost:8983/solr/my_core_name/update?commit=true&stream.body=<delete><query>*%3A*</query></delete>
Take a deep breath, do whatever you do for good luck, and submit the url (enter key works).
Now, you should be able to go back to the Solr admin page and run the original query and get zero results.
For everyone who doesn't like a lot of words :-)
curl http://localhost:8080/solr/update -H "Content-type: text/xml" --data-binary '<delete><query>*:*</query></delete>'
curl http://localhost:8080/solr/update -H "Content-type: text/xml" --data-binary '<commit />'
select XML on collection Document tab and update below parameter.
<delete><query>*:*</query></delete>
This solution is only applicable if you are deleting all the documents in multiple collections and not for selective deletion:
I had the same scenario, where I needed to delete all the documents in multiple collections. There were close to 500k documents in each shard and there were multiple shards of each collection. Updating and deleting the documents using the query was a big task and thus followed the below process:
Used the Solr API for getting the details for all the collections -
http://<solrIP>:<port>/solr/admin/collections?action=clusterstatus&wt=json
This gives the details like name of collection, numShards, configname, router.field, maxShards, replicationFactor, etc.
Saved the output json with the above details in a file for future reference and took the backups of all the collections I needed to delete the documents in, using the following API:
http://<solr-ip>:<port>/solr/admin/collections?action=BACKUP&name=myBackupName&collection=myCollectionName&location=/path/to/my/shared/drive
Further I deleted all the collections which I need to remove all the documents for using the following:
http://<solr-ip>:<port>/solr/admin/collections?action=DELETEALIAS&name=collectionname
Re-created all the collections using the details in the Step 1 and the following API:
http://<solr-ip>:<port>/solr/admin/collections?action=CREATE&name=collectionname&numShards=number&replicationFactor=number&maxShardsPerNode=number&collection.configName=configname&router.field=routerfield
I executed the above steps in loop for all the collections and was done in seconds for around 100 collections with huge data. Plus, I had the backups as well for all the collections.
Refer to this for other Solr APIs: DELETEALIAS: Delete a Collection Alias, Input
Under the Documents tab, select "raw XML or JSON" under Document Type and just add the query you need using the unique identifiers for each document.
{'delete': {'query': 'filter(product_id:(25634 25635 25636))'}}
If you want delete some documents by ID you can use the Solr POST tool.
./post -c $core_name ./delete.xml
Where the delete.xml file contains documents ids:
<delete>
<id>a3f04b50-5eea-4e26-a6ac-205397df7957</id>
</delete>

Searching wiki URLs using Solr

I am trying to index and search a wiki on our intranet using Solr. I have it more-or-less working using edismax but I'm having trouble getting main topic pages to show up first in the search results. For example, suppose I have some URLs in the database:
http://whizbang.com/wiki/Foo/Bar
http://whizbang.com/wiki/Foo/Bar/One
http://whizbang.com/wiki/Foo/Bar/Two
http://whizbang.com/wiki/Foo/Bar/Two/Two_point_one
I would like to be able to search for "foo bar" and have the first link returned as the top result because it is the main page for that particular topic in the wiki. I've tried boosting the title and URL field in the search but the fieldNorm value for the document keeps affecting the scores such that sub-pages score higher. In one particular case, the main topic page shows up on the 2nd results page.
Is there a way to make the First URL score significantly higher than the sub categories so that it shows up in the top-5 search results?
One possible approach to try:
Create a copyField with your url
Extract path only (so, no host, no wiki)
Split on / and maybe space
Lowercase
Boost on phrase or bigram or something similar.
If you have a lot of levels, maybe you want a multivalued field, with different depth (starting from the end) getting separate entries. That way a perfect match will get better value. Here, you should start experimenting with your real searches.

Get facet results only in Solr

I am trying to make search on my database using Solr, and i need to build a facet for the date of the articles(2011-6-12,2011-7-1 ..etc) and another facet for category(sport, news..etc) i built my php code using apache_solr_service and every thing is fine till now, i can do search for my data in the database, but i want to use facet to filter the articles that are created in specific date or to get the articles that belong to a specific category,
i used:
http://localhost:8888/solr/collection1/select?facet=true&facet.query=datecreated:2011-6-21&facet.mincount=1&wt=json&json.nl=map&q=ruba&start=0&rows=10
its returned all the articles that have 'ruba' word and give me the count of articles that have been created in 2011-6-21.
what i need is to get only the articles that have ruba word AND are created on 2011-6-21, i want only facet results to be returned
Try using filter query, fq=datecreated:2011-6-21 instead of facet

Resources