Searching PDF files stored in database using SOLR - solr

I have a lot of PDF files stored in a database (MSSQL) I need to search. They are stored as BLOB. I need a walk through on how to search them using SOLR.
I have a DB, lets call it "fred". Inside Fred is a table, we'll call it pdffiles. pdffiles has a column named pdfdata, of type BLOB.
The pdfs are stored in this table, with the binary data stored in the column. What steps do I take to get SOLR to extract this data and index it?
I'm guessing it involves the TikaEntityProcessor but having the pdfs stored in the database rather than just being regular files adds a level of complexity. I have previously worked with SOLR and have it running in production.
Sample dataconfig and schema files would be very useful.

What steps do I take to get SOLR to extract this data and index it?
create a new file called tika-data-config.xml which will have database configurations and the query to get the data.
You need to update the solrconfig.xml in a text editor and add the following within the config tags:
You need to mention the libs related to data-import handler.
Provide the respective database jar file.
Do the changes in the schema.xml file by mentioning your field. Add the proper fieldType for your field depending on your search requirement.
Once the setup is ready then you can request solr for indexing
using http://localhost:8983/solr/collection1/dataimport?command=full-import
Please refer the link at solr for more detailed...Configure DIH

Related

Highlighting Solr search results with bin/post and managed schema

I've got Solr 6.6.1 installed. I run bin/post to fetch and index some documents into a new core. I'd like to add a text field and highlight on that field. I notice that in server/solr/myCore/conf that there is a file, managed-schema, with a warning that tells me not to edit the file.
What's the supported way to use bin/post AND enable highlighting on a text field?
Solr implicitly uses a ManagedIndexSchemaFactory, which is by default "mutable" and keeps schema information in a managed-schema file.
You have several choices:
Go back to <schemaFactory class="ClassicIndexSchemaFactory"/>, so you will be able to change schema file manually.
Stay with managed Schema API and just modification operations via HTTP to add new field, which you will use for highlighting.
I would recommend to stick with #2, but it's totally up to you. Official documentation will help you to choose which schema options for your text fields you need to get the best out of highlighting.

How to download indexed document back from Solr?

I am able to index a document (Word, PDF) using Solr. Is there a possibility to get an original document back? I assume NO, because Solr stores an index only - but could you correct me if i am wrong with it?
If no - how typically is it resolved (I mean retrieving original docs back?) Storing them in a separate storage?
#Alec
Your understanding is correct.
You can't get back the original documents. As such your alternative is to store the original documents separately, have an unique ID generated in your main data store and link that unique ID to the SOLR export of the document so you can link back the search results.
In fact SOLR is designed for speed of search and is not as transaction friendly as a RDBMS.
So in my projects I use this strategy of maintaining an alternative datastore as the authoritative source of all application data (not just docs).
To give a bit about the internals of the document handling I'll suggest you look at the example on Solr Wiki https://wiki.apache.org/solr/ExtractingRequestHandler.
More later versions are documented here
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
Docs say Solr's ExtractingRequestHandler uses Tika to allow users to upload binary files to Solr and have Solr extract text from it and then index it.
This means that only the extracted text is actually stored in SOLR. The raw binary content is not really of use to SOLR for search / indexing purposes (and is presumably discarded although I haven't found exact text saying they discard the raw binary content of docs thus extracted).

Index content of PDFs with Solr and Tika

The problem briefly: I would like Sitecore to index the contents of PDFs using Solr's built in functionality (supplied by Tika). I'm not sure how to configure Sitecore's indexing to use this feature in Solr(Tika). (I think I need to write a custom indexer.)
I'm working with Sitecore 7 (7.1 Update 1) and want to index content from PDFs (or other rich media types). I'd like to index this data for search purposes.
I have Solr (4.6.1) installed and working with Sitecore 7. When I index my site it saves all of the documents to the correct Solr core, and I can successfully retrieve these documents for display.
Using curl, I can send a PDF to my Solr instance and get it indexed.
curl "http://localhost:8983/solr/update/extract?literal._id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true" -F "myfile=#sample.pdf"
This works, and I can read this content in my Sitecore web project and display it in views, so I know I can get access to this data. However, I would like the data to be attached to the items that I have uploaded in Sitecore.
I'd like something like this to happen when I upload a PDF to the Sitecore Media Library and publish the item, or at least when I re-index the site.
I'm currently walking through the following tutorial to learn some things about writing custom indexing (here is a link to part 1):
http://www.sitecore.net/Community/Technical-Blogs/Getting-to-Know-Sitecore/Posts/2013/04/Sitecore-7-Search-Provider-Part-1-Manually-Triggered-Indexing.aspx
Thanks for you patience.
For Sitecore, when handling media data, Lucene and Solr needed to index the content in a consistent way (so that you could switch between them if needed and still get data indexed in the same way). As Tika integration is very much a Solr thing it was decided that both should use the general windows concept of IFilters for indexing (http://en.wikipedia.org/wiki/IFilter)
This means that as long as you have the correct IFilter for that mime type installed on the machine that is doing the indexing, then the '_content' computedfield will be populated with the output.
This doesn't mean you can't use the Solr Tika integration but it is not supported by default and would be a customization.
It would be very simple to:
Disable the '_content' computed field
Set up a publish pipeline processor that looks at each item being published
Check if it is a media item
Check to see if it is a PDF
Issues a command to push the content to your Solr server for indexing by Tika.
You may want to see what results you get by using an IFilter, if the results are close enough to what you want then you can go with that, if Tika is producing better results for you then you should be able to switch to that, although you would probably have your media content indexed in a separate Solr core so you would lose any Sitecore specific metadata around the document.
Some blog posts that might be helpful:
http://www.samjgriffin.com/blog/2013/11/06/sitecore-7-pdf-and-document-content-search/
http://www.sitecore.net/Community/Technical-Blogs/John-West-Sitecore-Blog/Posts/2013/04/Sitecore-7-Indexing-Media-with-IFilters.aspx
I second the recommendation to use Sitecore's built-in MediaItemContentExtractor + IFilter approach, unless you've already ruled that out for some reason (IFilter difficulties, perhaps). If an IFilter is not an option, or if you're interested in the other approach regardless, I would integrate Tika a little differently than Stephen suggested, though.
The tutorial you referenced deals with writing your own search provider - i.e. replacing the built-in provider entirely. You should be able to leverage Sitecore's Solr provider and accomplish what you need with something lighter: a computed index field. The built-in media extractor mentioned above uses this approach, which enables you to put just about anything into your index during the normal indexing process. Here is a blog post from John West that walks through creating a basic computed index field: Sitecore 7: Computed Index Fields.
In short, write a class that implements IComputedIndexField and represents content extracted from PDFs or other rich documents. In your implementation of the ComputeFieldValue method:
Call GetMediaStream() on the document.
Pass the stream to Solr in an extract-only command and capture the result.
Return that result to store it in the computed index field.
If you need the media content to be in a particular existing index field, then configure the computed field as a copy field (see 3.3.3) into the existing field. Otherwise, configure your search to reference the computed field.
The main drawback here is the expense of passing the extracted content back and forth, rather than committing it directly to the index in a single step. Depending on your index size and contents, that may not be an issue for you.
One other potential option would be a post-rebuild task to add media content to the existing indexed documents. I am not certain this would work. It depends on knowing the IDs of the media items' documents and committing the rich document content in partial document updates, which this person was unsuccessful in attempting. If you try this, be sure to execute it in the indexing:end event prior to HTML cache clearing.
Whatever approach you take, if you want to work with Tika on a higher level than cURL, have a look at SolrNet's implementation of ExtractCommand and related classes.
If you could upgrade your site into sitecore 7.2 in which the media items content will be indexed automatically and there is no need to install the related IFilter, You should read the following:
Media Content Indexing in sitecore 7.2

lucene indexing security files

I'll try to briefly describe my problem and task.
My task is to create search engine for different types of file (only text file types) pdf, word, odf, xml but not html.
I have got little experience with lucene about year ago i wrote simple full text search using lucene and hibernate search. That was simple project. But now i have got very difficult task with searching.
We are using java 1.7 and glassfish 3 and i have to concentrate only server side approach not client ui. Ther is my three major problem :
1) All files is stored on webdav server, but information about file name , id file typ etc are stored into database (postgresql) so when i creating index i need to use both information. As a result of query i need only return file id from database. Summary content of file is stored in server but information about file is stored in database so we must retrieve both.
2) Secondary problem it that each file has a level of secrecy. But major problem is that this level is calculated dynamically. When calculating level of security for file we considering several properties. The static properties is files location, the folder in which the file is, but also dynamic information user profiles user roles and departments . So when user "Maggie" is logged she can search only files "test.pdf" , "test2.doc" etc but if user "Stev" is logged he have got different profiles such a Maggie so he can only search some phase in file "broken.pdf", "mybook.odt". test2.doc etc ..... . I think that when for example user search phase "lucene +solr" we search in all indexed documents and after that filtered result. But i think that solution is is not very efficient. What if results count 100 files , so what next we filtered step by step each files ? But i do not see any other solution. Maybe you can help me and lucene or solr have got mechanism to help.
3) Last problem is that some files are encrypted. So that files must be indexed only once before encryption ! But i think that if we indexed secure files so we get security issue. Because all word from that file is tokenized.
I have not got any idea haw to secure lucene documents and index datastore ? its possible ...
Also i have got question that i need to use Solr for my serarch engine or using only lucene and write own search engine ? So as you can see i have not got problem with indexing , serching but with security files and files secured levels.
Thanks for any hints and time you spend for me.
For Indexing both the File and Metadata of the file from DB check ExtractRequestHandler
You can pass the metadata attributes and the file to be indexed as a single request and it would be stored as a single document in lucene index.
For Security, One of the options is to store the Users/Roles who have access to the Files/Documents within the Solr index.
So you can always filter the results with the user/role to retrieve only the those results.
Make you Solr url secured so that Users don't have a direct access to the documents.
Also check for SOLR-1872
For encryption, Solr and underlying Parser Tika does provide handling for the Encrypted files by providing additional parameters.
Apache Solr uses the Apache Tika which uses the Bouncy Castle generic encryption libraries for extracting text content and metadata from encrypted PDF files. See http://www.bouncycastle.org/ for more details on Bouncy Castle.

Solr - not returning me the complete rows

myself new to Solr.
I have the below 2 issues :-
I am using Tomcat 6 and Oracle 10g as database. Solr 4 . When I deploy solr.war in tomcat I get exception in tomcat console that, dataimporthanler class not found. I have specified solr home and a lib directory in my solr home that contains all jar.
Still why does solr war expects to put the jars in the solr.war lib folder?
I have indexed the entity with full import.
I have a simple database table in Oracle. typical emp_id, emp_name, emp_dept.
I have defined data-config.xml and currently have only one document and entity. I have updated schema.xml accordingly.
when I do a /select query. I get only emp_id in the xml/json output.
How do I say what fields I want in the response?
I have 2222222 rows in the database, I get only 10 rows and if I edit config xml then only I get specified rows. My database table can grow, how do I get complete rows?
The problem is, I cannot say no of rows required which does not make sense since rows keep on increasing as transactions happen.
thanks,
1.Check that if lib directory in Solr Home contains apache-solr-dataimporthandler-x.y.z.jar and apache-solr-dataimporthandler-extras-x.y.z.jar files. Also check the solrconfig.xml file whether lib directory is configured correctly or not.
2.Add stored=true to the field definitions in the schema.xml file. If you don't set stored=true then your fields would not be shown in the output. If you want to return some fields then you can use fl=fieldName query options.
3.When you set rows parameter it will return that much results but you can also find a numFound field in the result that shows total number of rows found with given query.

Resources