Solr - Bringing back snippets from indexed data - solr

I have a Solr/Lucene set up where I have indexed a set of documents (MS Word files) and can happily search the content of these documents. However I would like to return a snippet from within the content of the document which shows where the matching line (+/- 5 words from the match term) is. I have tried to follow a range of Google hits but my indexing does not seem to have a direct access to the "content".
Can anyone give me some basic and simple pointers to where I might have made any errors on this - I have based all my work so far on the guidance and examples of the Solr Reference Guide - so I am not sure if the issue is in the search parameters or the original index.
I am doing this to create a clear set of user requirements for building an end solution rather than creating the end solution myself, so I am no expert on the tools and do not need to become one, just need to evidence what is possible with this tool set.

As MatsLindh noted above the issue was that the config was not drawing across the actual content of the Tika parse into a specific field, and so there was no full content of the text to display and highlight
To resolve this I followed the link (https://lucene.apache.org/solr/guide/7_1/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-solr-extractingrequesthandler) to the guidance documents and reviewed the part on fmap and used the example given for Last Modified Date as a guide on what to apply.
I then went to my solrconfig.xml file in the relevant core folder and added in the following line in the code beneath an already present fmap entry:
<str name="fmap.content">testcontent</str>
I had previously set up the testcontent field under the solr web interface in my core. I then re-ran my indexing line via a command prompt and that seemed to do the trick in terms of pulling out the basic content and rapping it with a basic emphasis.
All thanks for the input on this - still a lot more I want to test to help develop a clear requirement set but this really helps prove some of the basics are not complected.

Related

Azure suggester returning all content

I'm trying to implement an Azure suggester feature into our pilot Azure search app and running into issues. The content I'm indexing are PDF files, so my suggester definition is based on the content field itself which can be thousands of lines of text. Following examples online, when I implement the suggester, I'm returned the entire content of the body of text from the PDF file. What I'd really like to do is return just a phrase found in the text.
For instance, suppose I'm indexing a Harry Potter book and I type into my search field "Dum", I'd like to see suggested results back like "Dumbledore", "Dementor", etc VS the whole book. Is this possible?
Tks
If we want to search for words sharing the same prefix, Autocomplete is the right API for this job. https://learn.microsoft.com/en-us/rest/api/searchservice/autocomplete
In contrast, Suggester API helps users find the documents containing words with that prefix. It returns text snippets containing those worlds.
If you still believe suggester api does not behave as expected and autocomplete is not suitable, let me know your source document, query and expected results.

GAE Full Text Search: can only match exact word? how to search like contains(...)?

Just tried GAE(1.7.7 Java) Full Text Search and found if the search string is work, surprisingly it will not match working, worked, or hardworking, homework, I'd like to know if i miss something in the API, i read the tutorial but did not found any document about this except plural match.
Thanks.
P.S. I tried unit test for search service, not in working environment.
Tucked away in the docs (but unfortunately not in the table of operators), there is a '~' operator
To search for plural variants of an exact query, use the ~ operator:
~"car" # searches for "car" and "cars"
Not sure how far that will get you. Unfortunately thats about it.
See https://developers.google.com/appengine/docs/java/search/overview#Queries_on_Fields
There is so little documentation on this,but just having tried it, it just works on plurals.
One approach would be to do your own stemming on the words in the document, (though you wouldn't return that as the text ;-) Then you could perform stemming on your search term and be able to match worked, working etc..
This is a late answer, but to follow up the previous answer, what you want to do is not possible with the basic API functions. The search API works on full-text searching principles. To get around this you can tokenise your searchable data pre-index and store this in a field with the relevant document.
See: Partial matching GAE search API

Skipfish Dictionaries

I'm trying to explore Skipfish by Google
I went through their documentation , and also through the file README-FIRST ( present int eh dictionaries folder)
As far as I could understand , dictionaries are extremely useful for subsequent scans of the same target.
But what I haven't been able to understand so far - is : How is this being achieved ? What's the underlying mechanism that uses the dictionary and in what way ?
I'd really appreciate some help with this
Thanks
The dictionary consist of words (such as 'index' and 'cgi-bin') and extensions (such as 'old' and 'php') that are combined to form filenames skipfish attempts to access as part of a Dictionary attack. For example:
/some/path/index.old
/some/path/index.php
/some/path/cgi-bin.old
While crawling a site, skipfish can add new words to the dictionary as they are discovered in URLs and HTML.
Using a updated dictionary is beneficial because keywords discovered at the end of the first scan would be in the dictionary, ready to be used for probing, from the beginning of subsequent scans (instead of only being available after they are discovered within a scan).
The details of which words are combined with which extensions is discussed in the More about dictionary design section towards the bottom of dictionaries/README-FIRST
Note: I'm not a skipfish expert and you may get a better answer by posting in http://security.stackexchange.com
Also there is an article on skipfish at http://resources.infosecinstitute.com/skipfish-vulnerability-scanner/

Solr wildcard search, need to only highlight the matched part not the entire word

My text is like this: The searched word WildCard shall be partially highlighted
I search using wildcard expression "wild*".
What I need is to have the highlight snippet to be [tag]Wild[/tag]Card. What I got was [tag]WildCard[/tag], and I spent lots of time researching it, but could not find an answer.
This behavior can be illustrated on linkedin.com, where you type other people's name at the top right corner.
Once this is figured out, I will have a follow-up questions.
Thanks,
I am not sure if you can achieve what you want directly in solr. The obvious solution is to parse the returned doc yourself searching for [tag]WildCard[/tag] and find out what part of the term you need to highlight.
This is not possible with Solr. What you need to do is change the characters Solr uses to highlight found words to something you can easily remove (maybe an html comment) and then build a highlight yourself. You are likely already storing your query in a variable, so just scan the search return docs and highlight all instances of your query.

Jackrabbit XPath Issue

I'm relatively new to Jackrabbit. In our application we never turned on SearchIndex section within repository.xml (so as workspace.xml) files because we always go directly to a given document using the JCR UUID reference. We are using Jackrabbit v2.2.1 and Oracle as the repository. Now our requirements are getting expanded as we would like to use the document metadata feature to store contextual info about a document so that we can use the metadata to retrieve a selected set of documents.
As the first step, I added the default SearchIndex section in workspace.xml file and restarted the JCR.
I saw a bunch of lines like this in my log file - then I saw it created the index folder under workspace area.
2011-07-05 15:04:01.724 INFO [WebContainer : 0] MultiIndex.java:1204 indexing... /vfs:metaData/21ee130e-978e-415f-bfd1-7aa03d91608c/vfs:attributes (3500)
I have the folder structure like this. When I create a document in JCR, I specify the metadata info as part of the document which is by a complex XSD type with tags like docType, uploadedBy, contextValue, etc.
/ (root)
/MyApp (sub-folder)
/documents/ (sub-folder)
/document-1.pdf (file)
/document-2.pdf (file)
/accounts/ (sub-folder)
/account.txt (file)
etc...
The following XPath expression works.
//jcr:root/vfs:metaData//*[vfs:attributes/vfs:docType='TAX_DOCS']
If I give wrong value, for example instead of 'TAX_DOCS', 'TAX', it returns no documents as expected which is great. This proves that the metadata is correctly stored as expected and it is used in the filter process correctly.
The problem with this query is that it starts searching from the root folder but I want to search from /MyApp/documents sub-folder only. So I tried this:
//jcr:root/MyApp/documents//vfs:metaData//*[vfs:attributes/vfs:docType='TAX_DOCS']
It returns nothing. Then I tried this too but no success.
//jcr:root/MyApp/documents//*[vfs:metaData/vfs:attributes/vfs:docType='TAX_DOCS']
So what am I doing wrong? Is anything in workspace.xml configuration that we need to set or missing?
Any help is appreciated.
Thanks, Jack
Drop the double slashed from anything but the last path component and use the # notation for the attribute value, resulting in:
/jcr:root/MyApp/documents//*[vfs:attributes/#vfs:docType='TAX_DOCS']
The // construct looks for the whole subtree instead of just the immediate children like / does. The JCR specification only requires implementations to support the // construct as the last step of the XPath query.

Resources