I built a communicator, and now I want to add the Solr search engine to it.
Users create conversations, and every single conversation contains one or more messages. Messages are store as nodes in a tree. For example:
1. initial message
1.1 reply
1.2 another reply for initial message
1.2.1 bla bla bla...
1.2.2 Lorem ipsum dolorem...
1.3 third reply for initial message
There is always exactly one initial message.
I want to store in Solr content of all messages. I'm thinking about store data in this way:
{
"conversationId_s_lower": <conversation id here>,
"messageId_s_lower": <message id here>,
"content_txt_en": <message content here>
}
But I need to index and make searching also in properties of conversation:
{
conversationTitle_txt_en: "...",
conversationAccessUsersId: [123, 45, ...],
....
}
So the question is: how should I index this data, and how should I make queries?
Some questions to ask before you start designing. From solr perpective, you search for documents by giving a search term. So in your case, what do you consider a document to be. Is it the conversation or an individual message. Mostly a document is analogous to entity. So here I suppose a conversation. So it has an ID.
Next is each conversation is having multiple messages. I can see there are multiple levels to this message hierarchy. Do you want to maintain that? Or is it that all messages are considered to be under one level.
Then querying part - When you search, are you expecting messages or conversations count. This is decided anyways when you design entity as above.
Once you answer these questions, you can move to denormalizing or nested entities(in your case messages are nested under converations). With answers to the above, the rest of the process can be found on any solr article to index documents. let me know if you need any further information. Happy Designing and coding
Related
I want to manipulate doc and change the token value for field(s) by prepending some value to each token. I am doing bulk update through DIH and also posting Documents through SOLRJ. I have replication factor as 2, so Replication should also work. The value that I want to prepend is there in the document as a separate field. I am interested to know the place where I can intercept the document before the indexing so that I can manipulate it. One of the option I can think of overriding DirectUpdateHandler2. Is this the right place?
I can do it by externally processing the document and passing it to SOLR But I want to do it inside SOLR.
Document fields are :
city:mumbai
RestaurantName:Talk About
Keywords:Cofee, Chines, South Indian, Bar
I want to index keywords as
mumbai_cofee
mumbai_Chines
mumbai_South Indian
mumbai_Bar
the right place is an Update Request Processor, you make sure you plug that in sorlconfig.xml into all udpate handlers you are using (including DIH), and the single URP will cover all updates.
In your java code in the URP you can easily get the value of a field and then prepend it to all the others in another field etc. This happens before the doc is indexed.
I have the following problem to solve.
Client sends the id of the document. This is an HTTP Get to a proxy (not directly to SOLR). Example:
baseURL/movies/{id}
The response of this call will be a list of variants of this movie.
In order to find the variants I want to perform a SOLR search using title and some other fields, e.g.
/movies/select?q=title:spiderman+year:2001
it will expect the different variants of Spiderman e.g. SpiderMan, Spiderman HD. etc
The problem I have now is that the proxy service will not have the title of the original movie. It will get only the id of this movie for the API.
My approach so far is to get the original movie information using the id,
e.g.
/movies/select?id={id}
After I get the original movie then I perform a second request to SOLR search for the variants.
Any ideas how to avoid the two calls to SOLR search?
How do I delete all the documents in my SOLR index using the SOLR Admin.
I tried using the url and it works but want to know if the same can be done using the Admin..
Use one of the queries below in the Document tab of Solr Admin UI:
XML:
<delete><query>*:*</query></delete>
JSON:
{'delete': {'query': '*:*'}}
Make sure to select the Document Type drop down to Solr Command (raw XML or JSON).
Update: newer versions of Solr may work better with this answer: https://stackoverflow.com/a/48007194/3692256
My original answer is below:
I'm cheating a little, but not as much as writing the query by hand.
Since I've experienced the pain of accidental deletions before, I try to foolproof my deletions as much as possible (in any kind of data store).
1) Run a query in the Solr Admin Query screen, by only using the "q" parameter at the top left. Narrow it to the items you actually want to delete. For this example, I'm using *:*, but you can use things like id:abcdef or a range or whatever. If you have a crazy complex query, you may find it easier to do this multiple times, once for each part of the data you wish to delete.
2) On top of the results, there is a grayed out URL. If you hover the mouse over it, it turns black. This is the URL that was used to get the results. Right (context) click on it and open it in a new tab/window. You should get something like:
http://localhost:8983/solr/my_core_name/select?q=*%3A*&wt=json&indent=true
Now, I want to get it into a delete format. I replace the select?q= with update?commit=true&stream.body=<delete><query> and, at the end, the &wt=json&indent=true with </query></delete>.
So I end up with:
http://localhost:8983/solr/my_core_name/update?commit=true&stream.body=<delete><query>*%3A*</query></delete>
Take a deep breath, do whatever you do for good luck, and submit the url (enter key works).
Now, you should be able to go back to the Solr admin page and run the original query and get zero results.
For everyone who doesn't like a lot of words :-)
curl http://localhost:8080/solr/update -H "Content-type: text/xml" --data-binary '<delete><query>*:*</query></delete>'
curl http://localhost:8080/solr/update -H "Content-type: text/xml" --data-binary '<commit />'
select XML on collection Document tab and update below parameter.
<delete><query>*:*</query></delete>
This solution is only applicable if you are deleting all the documents in multiple collections and not for selective deletion:
I had the same scenario, where I needed to delete all the documents in multiple collections. There were close to 500k documents in each shard and there were multiple shards of each collection. Updating and deleting the documents using the query was a big task and thus followed the below process:
Used the Solr API for getting the details for all the collections -
http://<solrIP>:<port>/solr/admin/collections?action=clusterstatus&wt=json
This gives the details like name of collection, numShards, configname, router.field, maxShards, replicationFactor, etc.
Saved the output json with the above details in a file for future reference and took the backups of all the collections I needed to delete the documents in, using the following API:
http://<solr-ip>:<port>/solr/admin/collections?action=BACKUP&name=myBackupName&collection=myCollectionName&location=/path/to/my/shared/drive
Further I deleted all the collections which I need to remove all the documents for using the following:
http://<solr-ip>:<port>/solr/admin/collections?action=DELETEALIAS&name=collectionname
Re-created all the collections using the details in the Step 1 and the following API:
http://<solr-ip>:<port>/solr/admin/collections?action=CREATE&name=collectionname&numShards=number&replicationFactor=number&maxShardsPerNode=number&collection.configName=configname&router.field=routerfield
I executed the above steps in loop for all the collections and was done in seconds for around 100 collections with huge data. Plus, I had the backups as well for all the collections.
Refer to this for other Solr APIs: DELETEALIAS: Delete a Collection Alias, Input
Under the Documents tab, select "raw XML or JSON" under Document Type and just add the query you need using the unique identifiers for each document.
{'delete': {'query': 'filter(product_id:(25634 25635 25636))'}}
If you want delete some documents by ID you can use the Solr POST tool.
./post -c $core_name ./delete.xml
Where the delete.xml file contains documents ids:
<delete>
<id>a3f04b50-5eea-4e26-a6ac-205397df7957</id>
</delete>
I am completely new in solr and have the issue to have to go on developing our new search-engine, because my collegue is is not here anymore.
My problem:
I want to get facetted (hierarchical)categories with itemcount.
example
Search for 'Galaxy'
Items found: 123
Shown Categories:
Electronics (83)
Mobiles (60)
Tablets (23)
Smartphones (37)
.....
Books (40)
....
....
my category-fields (in solr) for each article contain several category-trees, seperated by comma.
e.g.:
"categories_raw": "Electronics/Mobiles/Tablets,Books/MobilePhones"
A query sent to my solr with the following parameters results facet_fields with item counts, but only with counts from the items own subcategory:
q=samsung&q.alt=samsung&...&facet=true&facet.field=categories_raw&facet.mincount=1
results (at the end of resulting JSON):
"facet_counts": {
"facet_queries": {},
"facet_fields": {
"categories_raw": [
"Electronics/Mobiles/Smartphones",
37,
"Books/MobilePhones",
20,
....
How can I get a count on each category like on my example on top?
Is it possible to break down my hierarchical category-string in field "categories_raw" by solr? Did I miss something?
Hope someone could help ;) thx
Solr counts what you give it. Try using multivalued fields and putting the values you want to count in there. So:
Electronics/Mobiles/Tablets
Electronics/Mobiles
Electronics
Then, the facets will give you counts for each of those. To create those, you can either do them on client side or look at writing a custom Update Request Processor to do it when the entry is created (before indexing).
Solr has also some support to extra hierarchical information from the structured path.
Imagine an index like the following:
id partno name description
1 1000.001 Apple iPod iPod by Apple
2 1000.123 Apple iPhone The iPhone
When the user searches for "Apple" both documents would be returned. Now I'd like to give the user the possibility to narrow down the results by limiting the search to one or more fields that have documents containing the term "Apple" within those fields.
So, ideally, the user would see something like this in the filter section of the ui after his first query:
Filter by field
name (2)
description (1)
When the user applies the filter for field "description", only documents which contain the term "Apple" within the field "description" would be returned. So the result set of that second request would be the iPod document only. For that I'd use a query like ?q=Apple&qf=description (I'm using the Extended DisMax Query Parser)
How can I accomplish that with Solr?
I already experimented with faceting, grouping and highlighting components, but did not really come to a decent solution to this.
[Update]
Just to make that clear again: The main problem here is to get the information needed for displaying the "Filter by field" section. This includes the names of the fields and the hits per field. Sending a second request with one of those filters applied already works.
Solr just plain Doesn't Do This. If you absolutely need it, I'd try it the multiple requests solution and benchmark it -- solr tends to be a lot faster than what people put in front of it, so an couple few requests might not be that big of a deal.
you could achieve this with two different search requests/queries:
name:apple -> 2 hits
description:apple -> 1 hit
EDIT:
You also could implement your own SearchComponent that executes multiple queries in the background and put it in the SearchHandler processing chain so you only will need a single query in the frontend.
if you want the term to be searched over the same fields every time, you have 2 options not breaking the "single query" requirement:
1) copyField: you group at index time all the fields that should match togheter. With just one copyfield your problem doesn't exist, if you need more than one, you're at the same spot.
2) you could filter the query each time dynamically adding the "fq" parameter at the end
http://<your_url_and_stuff>/?q=Apple&fq=name:Apple ...
this works if you'll be searching always on the same two fields (or you can setup them before querying) otherwise you'll always need at least a second query
Since i said "you have 2 options" but you actually have 3 (and i rushed my answer), here's the third:
3) the dismax plugin described by them like this:
The DisMaxQParserPlugin is designed to process simple user entered phrases
(without heavy syntax) and search for the individual words across several fields
using different weighting (boosts) based on the significance of each field.
so, if you can use it, you may want to give it a look and start from the qf parameters (that is what the option number 2 wanted to be about, but i changed it in favor of fq... don't ask me why...)
SolrFaceting should solve your problem.
Have a look at the Examples.
This can be achieved with Solr faceting, but it's not neat. For example, I can issue this query:
/select?q=*:*&rows=0&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json
to find the number of documents containing donkey in the title and text fields. I may get this response:
{
"responseHeader":{"status":0,"QTime":1,"params":{"facet":"true","facet.query":["title:donkey","text:donkey"],"q":"*:*","wt":"json","rows":"0"}},
"response":{"numFound":3365840,"start":0,"docs":[]},
"facet_counts":{
"facet_queries":{
"title:donkey":127,
"text:donkey":4108
},
"facet_fields":{},
"facet_dates":{},
"facet_ranges":{}
}
}
Since you also want the documents back for the field-disjunctive query, something like the following works:
/select?q=donkey&defType=edismax&qf=text+titlle&rows=10&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json