Show extracted content from tika in Frontend - solr

I work with TYPO3 10.4.18, solr_file_indexer 2.3.1 and tika 6.0.0.
For Tika I have the solr server as host.
The indexing of the pages, extensions and documents works flawlessly.The search index contains the content of the documents.
Now I want to display the search results for the documents like the page result list. But I can't find a variable which contains the extracted content from tika for the frontend and can be used in the document.html file of solr.
Is there any additional configuration needed here?

With help from #swilking I found the simple answer for my question:
In the file document.html of the extension solr write something like
<f:if condition="{document.type} == 'sys_file_metadata'">
<div>{s:document.highlightResult(resultSet:resultSet, document:document,
fieldName:'content')}</div>
</f:if>
to get the file content with the highlight feature.

Related

Can I get the original PDF files with Solr in the UI after indexing?

I want to build a enterprise search engine with solr. I am indexing some PDFs and doc files to Solr. I am creating UI with SolrJ. Can I get the original PDF files with Solr in the UI?
Solr won't store the files at its end, So there is no scope of getting the original PDF file from solr.
But you can store the path of the file and can provide a link in the user interface for file to be downloaded.
You can use the path given by solr and get the file from that location.

Apache Solr and Lucene PDF search

I want to search for some text in PDF, DOC, JPG file.
I have idea of lucene how to extract data from PDF and DOC and create index on full content.
I know Solr and Lucene has some highlight feature but I am wondering if Solr or lucene highlight matched results in PDF or DOC itself and displaying it to user.
Does Solr or Lucene has this functionality?

How to search through uploaded documents, Asp.net mvc?

I am making a website where users can upload documents and search also a search functionality to find documents. My question is, how do i add a search functionality that searches not only the title of the document but also the document itself.
Ex.
Title: Reaction to The Perl
Text: {Whole Document}
If we search for 'Kino' (Which is appears in {Whole Document}), this document should show up as a result to the search.
Edit:
Currently I have them uploaded to a folder on the system and the database just contains a title and a link to the file. I have not implemented the search functionality yet.
Also I am using asp.net mvc, and sql server, if that matters.
You could use Lucene.Net to implement the search functionality (available for download from NuGet). You just need to add the documents and fields to a search index and then execute the search through the API.
I find this tutorial for Lucene.Net a useful example.

Apache Nutch Crawl Dynamic Products

Currently we are using Apache Solr as Search Engine and Apache Nutch as Crawler. Now we have created a site site which contains products which gets generated dynamically.
As current setup will search the content within content field, so whenever we are searching for dynamic Product, then its not coming in search results.
Can you please guide me how to crawl and index Dynamic Product on a Page to Apache Solr? Can we do this using Sitemap.xml, If yes then please suggest how?
Thanks!
One possible solution is this:
Step 1) Put the description of each dynamic product in its own page. e.g http://domain/product?id=xxx (or with more friendly url such as http://domain/product-x).
Step 2) You need a page or several pages that list urls of these products. The sitemap.xml you mentioned is one choice but a simple html page is also suffice. So, for instance, you can dynamically generate a page named products_list which contains entries like this: Product x.
Step 3) You should either add url of products_list page to your nutch seed file or include a link to it in one of already crawling pages.

Posting wget html pages to Solr

How can I post html web pages to a Solr index when downloading them with wget? How could I modify the following example so that it gets indexed simultaneously? wget -P /var/myserver/archive http://www.somesite/products.html
I can't spot an obvious example in the Solr documentation and would be grateful for any pointers.
You can check Apache Nutch, which is an Open source web crawler.
You can provide Nutch with a base page and it will help you index the page as well as the links in it.
Nutch integrates with Solr so the pages would be indexed by Solr and be searchable.
However, if its just couple of pages with not Spider capabilities you can just download the html pages and feed it to solr through Client code.
Solr have HTML filters which will hep to extract content from this pages and index them as text.

Resources