dynamicField in django-haystack SOLR config - solr

How to configure search_indexes.py to index dynamicFields in django-haystack. I'm using SOLR as the search engine for haystack.

If you are using Haystack's build_solr_schema management command to create your schema.xml, notice that it automatically includes various dynamicFields for popular field types. For example, check out the schema template for Haystack v2.1. (This looks like it's been there since Haystack v1.)
This allows you to create dynamically-named fields in your search index's prepare method. For example, if you were indexing notes that could have an id string for your ever-changing group of partners, you could do this:
def prepare(self, obj):
self.prepared_data = super(NoteIndex, self).prepare(obj)
for (partner_name, partner_id) in get_partners():
self.prepared_data['%s_s' % partner_name] = partner_id
return self.prepared_data
The key thing here is that the field name ends with "_s", which according to the schema is a dynamic name for string types.
Unfortunately, these dynamic partner fields are not explicitly defined at the top of your SearchIndex class. You may want to mention this in a comment.

As far as I can see in the source code of django-haystack 1.2.* you can't do this. You can write own schema instead of generating it using management commands and use it.

As #nofinator say, you can do this in .prepare method of SearchIndex class by concatinating field name with prefix from SOLR Schema.xml.
By default Haystack(current ver. is 2.1.1) ships with some default DynamicField like a *_s. But if you want, you can make your own DynamicField.
In my project ill make attr_* field and its work fine.
All you need to do, is add this field, with same syntax in Schema.xml
You can do in manualy or overriding standart Haystack build_solr_schema management command.(Btw, its uses standart django render template fnc. so its pretty easy.

Related

Azure suggester returning all content

I'm trying to implement an Azure suggester feature into our pilot Azure search app and running into issues. The content I'm indexing are PDF files, so my suggester definition is based on the content field itself which can be thousands of lines of text. Following examples online, when I implement the suggester, I'm returned the entire content of the body of text from the PDF file. What I'd really like to do is return just a phrase found in the text.
For instance, suppose I'm indexing a Harry Potter book and I type into my search field "Dum", I'd like to see suggested results back like "Dumbledore", "Dementor", etc VS the whole book. Is this possible?
Tks
If we want to search for words sharing the same prefix, Autocomplete is the right API for this job. https://learn.microsoft.com/en-us/rest/api/searchservice/autocomplete
In contrast, Suggester API helps users find the documents containing words with that prefix. It returns text snippets containing those worlds.
If you still believe suggester api does not behave as expected and autocomplete is not suitable, let me know your source document, query and expected results.

How to query solr field for a substring

My use case:
I have a single-valued field called cqpath. This is a textfield and has a values that look something like the following:
"/content/domain/en/path/to/some/page"
"/content/domain/en/path/to/another/page"
"/content/domain/en-us/path/to/some/page"
"/content/domain/en-us/path/to/another/page"
I wanted to form a query that would return me 1. and 2. I'd been trying along the lines of writing:
cqpath: "/content/domain/en"
which has been discovered to be erroneous, since it retrieves items 3. and 4. as well. Could any of you think of a way to write a query that returns only 1. and 2. and not 3. and 4.?
This is a normal textfield field-type. Really do appreciate your help.
Starting from Solr 4.0 you can use a regex query. You can find some useful examples here.
In your case, you can get the results that you're looking for using something like:
cqpath:/.*content/domain/en.*/
It looks like you are trying to match partial paths here with boundaries on path elements (slashes). The usual generic solution is to tokenize during index to generate all alternative completions and not tokenize during query. So, the field type declaration is not symmetric. There are examples of that in Solr distribution. And you would look at using something like (index-time only) EdgeNGramFilterFactory instead of much more expensive regex matching.
For your specific case, you may want to look at testPathHierarchyTokenizer which does that for you automatically.
And if your content were more like full URLs than just path, you could also be interested by a custom update request processor chain that includes URLClassify URP. It is not very documented, but mentions generating url parts, which is what I think you would want.

solr highlighting in old and new versions

I am migrating a web site from an old version of solr (1.4.1) to the current release version (5.2.1) on a different machine and noticing some differences.
In the old version, I could get highlighting with a url like this:
http://localhost:8983/solr/select?indent=on&q=text:software/&start=0&rows=10&fl=id,score,title&wt=json&hl=on&hl.fragsize=200
In the new version, one thing that's different is I need to specify a collection. Another difference is that the new version gives an error if I put text: in front of the value of q.
So, taking into account those differences, I end up with a URL like this:
http://localhost:8983/solr/default/select?indent=on&q=software/&start=0&rows=10&fl=id,score,title&wt=json&hl=on&hl.snippets=1&hl.fl=%2a&hl.fragsize=200
That second URL does not give me highlighting fragments/snippets. That is to say, where the old URL would give something like this:
"highlighting":{
"document0_id":{"text":["The <em>software</em> is awesome"]}}
The new URL gives something like this:
"highlighting":{
"document0_id":{}}
What do I need to do to get highlighting fragments returned in solr 5.2.1?
[edited]
In addition, I tried selecting a single document by its id on both machines. On the old machine, a url like
http://localhost:8983/solr/select?wt=json&indent=true&q=id:thedocumentid
returns some JSON that includes a text field containing the full searchable text of the original HTML document. On the new machine a similar url (but one that includes the collection):
http://localhost:8983/solr/default/select?wt=json&indent=true&q=id:thedocumentid
...returns similar JSON that does not include the text field.
I note that searching returns the correct results; the problem is that on the new machine, the results do not include the highlighting fragments.
So it seems like maybe the issue is that I need to specify that these documents have a text field when I index them; how do I do that?
A colleague (not tempted by the bounty) noticed that my text field had stored="false" in my schema.xml and suggsted changing it to true. That did the trick.
In the first query you are specifically searching in the text field and in the second its not.
And in the second you have mentioned hl.fl which means "Specifies a list of fields to highlight. Accepts a comma- or space-delimited list of fields for which Solr should generate highlighted snippets. If left blank, highlights the defaultSearchField"
Try again by making the changes...
http://localhost:8983/solr/default/select?q=text:software&start=0&rows=10&fl=id,score,title&wt=json&hl=on&hl.fragsize=200

Generate highlighted snippets in Solr for PDFs

I'm new to solr. I've set up a solr server and have indexed a few thousand PDFs. I am trying to query solr via the rest API in a PHP page. I am trying to build something similar to the solritas interface included in the tutorial (solrserver/browse), but I don't know how to generate highlighted snippets. I found in the documentation "hl" is a query parameter and is by default set to false.
When I get http://solrserver/?q=search+term&hl=true I get back a response with a hightlighting section, but it only contains the document IDs, no generated snippets.
I am using the tutorial provided schema and config for solr 4.2.1. I believe that the configuration is fine because solritas is able to display highlighted snippets using the same indexed data. I've tried seeing how solritas is built but it's separated out in .vm template files and I haven't been able to find what I'm looking for yet.
I can see the full text of the PDF in the doc->content area, so it is stored. I think I just don't understand the proper way to generate snippets! Can someone please help!
Thanks :)
I would suggest, you should try using hl.fl parameter. So your query should be something like this:
?q=search+term&hl=true&hl.fl=field1,field2,field3
Where field1, field2 and field3 are three source fields you would like to generate highlights.
In your case, if the field name you want to use for highlighting is content, your query can be:
?q=search+term&hl=true&hl.fl=content
More details: http://docs.lucidworks.com/display/solr/Highlighting
With highlighting, you can even specify fragment size, HTML tags around highlighted text etc...

Jackrabbit XPath Issue

I'm relatively new to Jackrabbit. In our application we never turned on SearchIndex section within repository.xml (so as workspace.xml) files because we always go directly to a given document using the JCR UUID reference. We are using Jackrabbit v2.2.1 and Oracle as the repository. Now our requirements are getting expanded as we would like to use the document metadata feature to store contextual info about a document so that we can use the metadata to retrieve a selected set of documents.
As the first step, I added the default SearchIndex section in workspace.xml file and restarted the JCR.
I saw a bunch of lines like this in my log file - then I saw it created the index folder under workspace area.
2011-07-05 15:04:01.724 INFO [WebContainer : 0] MultiIndex.java:1204 indexing... /vfs:metaData/21ee130e-978e-415f-bfd1-7aa03d91608c/vfs:attributes (3500)
I have the folder structure like this. When I create a document in JCR, I specify the metadata info as part of the document which is by a complex XSD type with tags like docType, uploadedBy, contextValue, etc.
/ (root)
/MyApp (sub-folder)
/documents/ (sub-folder)
/document-1.pdf (file)
/document-2.pdf (file)
/accounts/ (sub-folder)
/account.txt (file)
etc...
The following XPath expression works.
//jcr:root/vfs:metaData//*[vfs:attributes/vfs:docType='TAX_DOCS']
If I give wrong value, for example instead of 'TAX_DOCS', 'TAX', it returns no documents as expected which is great. This proves that the metadata is correctly stored as expected and it is used in the filter process correctly.
The problem with this query is that it starts searching from the root folder but I want to search from /MyApp/documents sub-folder only. So I tried this:
//jcr:root/MyApp/documents//vfs:metaData//*[vfs:attributes/vfs:docType='TAX_DOCS']
It returns nothing. Then I tried this too but no success.
//jcr:root/MyApp/documents//*[vfs:metaData/vfs:attributes/vfs:docType='TAX_DOCS']
So what am I doing wrong? Is anything in workspace.xml configuration that we need to set or missing?
Any help is appreciated.
Thanks, Jack
Drop the double slashed from anything but the last path component and use the # notation for the attribute value, resulting in:
/jcr:root/MyApp/documents//*[vfs:attributes/#vfs:docType='TAX_DOCS']
The // construct looks for the whole subtree instead of just the immediate children like / does. The JCR specification only requires implementations to support the // construct as the last step of the XPath query.

Resources