Apache Solr and Lucene PDF search - solr

I want to search for some text in PDF, DOC, JPG file.
I have idea of lucene how to extract data from PDF and DOC and create index on full content.
I know Solr and Lucene has some highlight feature but I am wondering if Solr or lucene highlight matched results in PDF or DOC itself and displaying it to user.
Does Solr or Lucene has this functionality?

Related

Show extracted content from tika in Frontend

I work with TYPO3 10.4.18, solr_file_indexer 2.3.1 and tika 6.0.0.
For Tika I have the solr server as host.
The indexing of the pages, extensions and documents works flawlessly.The search index contains the content of the documents.
Now I want to display the search results for the documents like the page result list. But I can't find a variable which contains the extracted content from tika for the frontend and can be used in the document.html file of solr.
Is there any additional configuration needed here?
With help from #swilking I found the simple answer for my question:
In the file document.html of the extension solr write something like
<f:if condition="{document.type} == 'sys_file_metadata'">
<div>{s:document.highlightResult(resultSet:resultSet, document:document,
fieldName:'content')}</div>
</f:if>
to get the file content with the highlight feature.

Full Text search of content of a pdf file stored in database MVC

I am working with an ASP.NET MVC project running on the .NET Framework 4.7.2. I have a database of some pdf documents and a related column description. I want to perform full text Searching in the database so the text in the description column as well as the text of the PDF documents stored in the database are searched for that string value and make a list of all the items (documents and other data that matches the search) return a list.
I searched for this but this made me more confused about indexing and paging in search and How I can perform a search in the database to retrieve the list of matched documents through document content.

Can I get the original PDF files with Solr in the UI after indexing?

I want to build a enterprise search engine with solr. I am indexing some PDFs and doc files to Solr. I am creating UI with SolrJ. Can I get the original PDF files with Solr in the UI?
Solr won't store the files at its end, So there is no scope of getting the original PDF file from solr.
But you can store the path of the file and can provide a link in the user interface for file to be downloaded.
You can use the path given by solr and get the file from that location.

How to search through uploaded documents, Asp.net mvc?

I am making a website where users can upload documents and search also a search functionality to find documents. My question is, how do i add a search functionality that searches not only the title of the document but also the document itself.
Ex.
Title: Reaction to The Perl
Text: {Whole Document}
If we search for 'Kino' (Which is appears in {Whole Document}), this document should show up as a result to the search.
Edit:
Currently I have them uploaded to a folder on the system and the database just contains a title and a link to the file. I have not implemented the search functionality yet.
Also I am using asp.net mvc, and sql server, if that matters.
You could use Lucene.Net to implement the search functionality (available for download from NuGet). You just need to add the documents and fields to a search index and then execute the search through the API.
I find this tutorial for Lucene.Net a useful example.

Posting wget html pages to Solr

How can I post html web pages to a Solr index when downloading them with wget? How could I modify the following example so that it gets indexed simultaneously? wget -P /var/myserver/archive http://www.somesite/products.html
I can't spot an obvious example in the Solr documentation and would be grateful for any pointers.
You can check Apache Nutch, which is an Open source web crawler.
You can provide Nutch with a base page and it will help you index the page as well as the links in it.
Nutch integrates with Solr so the pages would be indexed by Solr and be searchable.
However, if its just couple of pages with not Spider capabilities you can just download the html pages and feed it to solr through Client code.
Solr have HTML filters which will hep to extract content from this pages and index them as text.

Resources