Apache Solr : How to access and index files from another server - solr

Solr version :: 6.6.1
I am new to the Apache Solr and currently exploring how to use this technology to search in the PDF files.
https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html#the-tikaentityprocessor
I am able to index the PDF files using the "BinFileDataSource" for the PDF files within the same server as shown in the below example.
Now i want to know if there is a way to change the baseDir pointing to the folder present under a different server.
Please suggest an example to access the PDF files from another server. How will i write the path in the baseDir attribute.
<dataConfig>
<dataSource type="BinFileDataSource"/> <!--Local filesystem-->
<document>
<entity name="K2FileEntity" processor="FileListEntityProcessor" dataSource="null"
recursive = "true"
baseDir="C:/solr-6.6.1/server/solr/core_K2_Depot/Depot" fileName=".*pdf" rootEntity="false">
<field column="file" name="id"/>
<field column="fileLastModified" name="lastmodified" />
<entity name="pdf" processor="TikaEntityProcessor" onError="skip"
url="${K2FileEntity.fileAbsolutePath}" format="text">
<field column="title" name="title" meta="true"/>
<field column="dc:format" name="format" meta="true"/>
<field column="text" name="text"/>
</entity>
</entity>
</document>
</dataConfig>

I finally found the answer from the solr-user mailing list.
Just change the baseDir to the folder present on another server (SMB paths works directly):
baseDir="\\CLDServer2\RemoteK2Depot"

Related

solr : How to Clear the baseDir folder after the DIH import

Solr Version :: 6.6.1
I am able to import the pdf files into the Solr system using the DIH and performs the indexing as expected. But i wish to clear the folder C:/solr-6.6.1/server/solr/core_K2_Depot/Depot after the successful finish of the indexing process.
Please suggest, if there is a way to delete all the files from the folder via the DIH data-config.xml or by another easier way.
<!--Local filesystem-->
<dataConfig>
<dataSource type="BinFileDataSource"/>
<document>
<entity name="K2FileEntity" processor="FileListEntityProcessor" dataSource="null"
recursive = "true"
baseDir="C:/solr-6.6.1/server/solr/core_K2_Depot/Depot" fileName=".*pdf" rootEntity="false">
<field column="file" name="id"/>
<field column="fileLastModified" name="lastmodified" />
<entity name="pdf" processor="TikaEntityProcessor" onError="skip"
url="${K2FileEntity.fileAbsolutePath}" format="text">
<field column="title" name="title" meta="true"/>
<field column="dc:format" name="format" meta="true"/>
<field column="text" name="text"/>
</entity>
</entity>
</document>
</dataConfig>
Usuaully, in production you want to run DIH proces via shell scripts, which are at first copying needed files for ftp, http, s3, etc, than runs full-import or delta-import and later track the status of the indexing via status command as soon as it will successfully ends you just need to execute rm command
while flag; do
curl -XGET // get status of the DIH
if finished change flag to false
rm files -rf // removing not needed files for indexing
There are no any support of deleting external files in Solr

SOLR - TikaEntityProcessor - BLOB Import

I'm having issues figuring out exactly how to import blob data from a SQL Server database into SOLR.
This is hooked into NAV as well. I've managed to get the data out of the table within NAV, however I need this data in SOLR for search purposes.
Here's my current dataConfig file.
<dataConfig>
<dataSource name="dastream" type="FieldStreamDataSource" />
<dataSource name="db" driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://localhost;databaseName=TestingDB" user="sa" password="*******" />
<document name="items">
<entity name="item" query="select [No_], [Desc_ English] as desceng from [Foo$Item]" dataSource="db">
<field column="No_" name="id" />
<entity processor="TikaEntityProcessor" url="desceng" dataField="item.desceng" name="blob" dataSource="dastream" format="text" >
<field column="text" name="desceng" />
</entity>
</entity>
</document>
</dataConfig>
The error I keep getting is:
Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.RuntimeException: unsupported type : class java.lang.String
I'm not sure what I'm missing.
Maybe this is because Nav stores blobs in his own way. See this question. There ia an example how to extract data using python.

Hi I want the file name using filelistentityprocessor and lineentityprocessor

This is my data-config.xml. I can't use Tika EntityProcessor. Is there any way I can do it with LineEntityProcessor?
I am using solr4.4 to index million of documents . i want the file names and modified time to be indexed as well . But couldnot find the way to do it.
In the data-config.xml I am fetching files using filelistentityprocessor and then parsing each and every line using lineentityprocessor.
<dataConfig>
<dataSource encoding="UTF-8" type="FileDataSource" name="fds" />
<document>
<entity
name="files"
dataSource="null"
rootEntity="false"
processor="FileListEntityProcessor"
baseDir="C:/Softwares/PlafFiles/"
fileName=".*\.PLF"
recursive="true"
>
<field column="fileLastModified" name="last_modified" />
<entity name="na_04"
processor="LineEntityProcessor"
dataSource="fds"
url="${files.fileAbsolutePath}"
transformer="script:parseRow23">
<field column="url" name="Plaf_filename"/>
<field column="source" />
<field column="pict_id" name="pict_id" />
<field column="pict_type" name="pict_type" />
<field column="hierarchy_id" name="hierarchy_id" />
<field column="book_id" name="book_id" />
<field column="ciscode" name="ciscode" />
<field column="plaf_line" />
</entity>
</entity>
</document>
</dataConfig>
From the documentation of FileListEntityProcessor:
The implicit fields generated by the FileListEntityProcessor are fileDir, file, fileAbsolutePath, fileSize, fileLastModified and these are available for use within the entity [..].
You can move these values into differently named fields by referencing them:
<field column="file" name="filenamefield" />
<field column="fileLastModified" name="last_modified" />
This will require that you have a schema.xml that actually allows those two names.
If you need to use them in another string / manipulate it further before inserting:
You're already using files.fileAbsolutePath, so by using ${files.file} and ${files.fileLastModified} you should be able to extract the values you want.
You can modify these values and insert them into a specific field by using the TemplateTransformer and referencing the generated fields:
<field column="filename" template="file:///${files.file}" />

Getting Exception while reading file content using solr line entity processor

I need to search inside the file contents for which I am using Solr Data Import Handler. The response should show the content line where the search word is appearing. So for processing line by line I am using Line Entity Processor. My data-config file is
<dataConfig>
<dataSource type="BinFileDataSource" name = "fds"/>
<document>
<entity name="filelist" processor="FileListEntityProcessor" fileName="sample.docx"
rootEntity="false" baseDir="C:\SampleDocuments" >
<entity name="fileline" processor="LineEntityProcessor"
url="${filelist.fileAbsolutePath}" format="text">
<field column="linecontent" name="rawLine"/>
</entity>
</entity>
</document>
The schema.xml is having entry or rawLine.
<field name="rawLine" type="text" indexed="true" stored="true"/>
But when I am running the command for full-import, its throwing an exception
DataImportHandlerException:java.lang.ClassCastException: java.io.FileInputStream cannot be cast to java.io.Reader
Please help me on this as I have spend few days on this problem.
BinFileDataSource works with InputStream while FileDataSource.
You can try using the FileDataSource instead to check for the Casting issue.
<dataSource type="FileDataSource" name = "fds"/>

unsupported type Exception on importing documents from Database with Solr 4.0

Looked up information provided on a related question to set up a import of all documents that are stored within a mysql database.
you can find the original question here
Thanks to steps provided I was able to make it work for me with mysql DB. My config looks identical to the one mentioned at above link.
<dataConfig>
<dataSource name="db"
jndiName="java:jboss/datasources/somename"
type="JdbcDataSource"
convertType="false" />
<dataSource name="dastream" type="FieldStreamDataSource" />
<dataSource name="dareader" type="FieldReaderDataSource" />
<document name="docs">
<entity name="doc" query="select * from document" dataSource="db">
<field name="id" column="id" />
<field name="name" column="descShort" />
<entity name="comment"
transformer="HTMLStripTransformer" dataSource="db"
query="select id, body, subject from comment where iddoc='${doc.id}'">
<field name="idComm" column="id" />
<field name="detail" column="body" stripHTML="true" />
<field name="subject" column="subject" />
</entity>
<entity name="attachments"
query="select id, attName, attContent, attContentType from Attachment where iddoc='${doc.id}'"
dataSource="db">
<field name="attachment_name" column="attName" />
<field name="idAttachment" column="id" />
<field name="attContentType" column="attContentType" />
<entity name="attachment"
dataSource="dastream"
processor="TikaEntityProcessor"
url="attContent"
dataField="attachments.attContent"
format="text"
onError="continue">
<field column="text" name="attachment_detail" />
</entity>
</entity>
</entity>
</document>
</dataConfig>
I have a variety of attachments in DB such as jpeg, pdf, excel, doc and plain text. Now everything works great for most of the binary data (jpeg, pdf doc and such). But the import fails for certain files. It appears that the datasource is set up to throw an exception when it encounters a String instead of an InputStream. I set the onError="continue" flag on the entity "attachment" to ensure that the DataImport went through despite this error. Noticed that this problem has happened for a number of files. The exception is given below. Ideas ??
Exception in entity : attachment:java.lang.RuntimeException: unsupported type : class java.lang.String
at org.apache.solr.handler.dataimport.FieldStreamDataSource.getData(FieldStreamDataSource.java:89)
at org.apache.solr.handler.dataimport.FieldStreamDataSource.getData(FieldStreamDataSource.java:48)
at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:103) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468)
I know this is an outdated question, but:
it appears to me that this exception is thrown when the BLOB (I work with Oracle) is null. When I add a where clause like "blob_column is not null", the problem disappears for me (Solr 4.10.1)

Resources