Solr Version :: 6.6.1
I am able to import the pdf files into the Solr system using the DIH and performs the indexing as expected. But i wish to clear the folder C:/solr-6.6.1/server/solr/core_K2_Depot/Depot after the successful finish of the indexing process.
Please suggest, if there is a way to delete all the files from the folder via the DIH data-config.xml or by another easier way.
<!--Local filesystem-->
<dataConfig>
<dataSource type="BinFileDataSource"/>
<document>
<entity name="K2FileEntity" processor="FileListEntityProcessor" dataSource="null"
recursive = "true"
baseDir="C:/solr-6.6.1/server/solr/core_K2_Depot/Depot" fileName=".*pdf" rootEntity="false">
<field column="file" name="id"/>
<field column="fileLastModified" name="lastmodified" />
<entity name="pdf" processor="TikaEntityProcessor" onError="skip"
url="${K2FileEntity.fileAbsolutePath}" format="text">
<field column="title" name="title" meta="true"/>
<field column="dc:format" name="format" meta="true"/>
<field column="text" name="text"/>
</entity>
</entity>
</document>
</dataConfig>
Usuaully, in production you want to run DIH proces via shell scripts, which are at first copying needed files for ftp, http, s3, etc, than runs full-import or delta-import and later track the status of the indexing via status command as soon as it will successfully ends you just need to execute rm command
while flag; do
curl -XGET // get status of the DIH
if finished change flag to false
rm files -rf // removing not needed files for indexing
There are no any support of deleting external files in Solr
Related
TLDR
How do I configure solr Data Import Handler so it will import html similar to solr's "post" utility ?
Context
We're doing a small project where code will export a set pages from wiki/confluence to 'straight html' (for availability in a DR data center--straight html pages will not depend on a database, etc)
We want to index the html pages in solr.
We "have it working" using the solr-shipped "post utility"
post -c OPERATIONS -recursive -0 -host solr $(find . -name '*.html')
This is fine.....However, we would like to leverage the Data Import Handler (DIH), i.e. replace the shell command with a single http call to the DIH endpoint ('/dataimport')
Question
How do I configure the tika "data config xml" file to get "similar functionality" as the solr "post command" ?
when I configure with data-config.xml, solr document only ends up with an "id" and "version" fields (i.e. where id is the untokenized file name)
correction: i had originally wrote '"id" and "title" field..."'
"id":"database_operations_2019.html",
"_version_":1650836000296927232},
however when I use "bin/post" the document has these fields, i.e. including tokenized title:
"id":"/usr/local/html/OPERATIONS_2019_1119_1500/./database_operations_2019.html",
"stream_size":[54115],
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"],
"stream_content_type":["text/html"],
"dc_title":["Database Operations 2019 Guidebook"],
"content_encoding":["UTF-8"],
"content_type_hint":["text/html; charset=UTF-8"],
"resourcename":["/usr/local/html/OPERATIONS_2019_1119_1500/./database_operations_2019.html"],
"title":["Database Operations 2019 Guidebook"],
"content_type":["text/html; charset=UTF-8"],
"_version_":1650834641083432960},
Some Points
I've tried RTM'ing, but do not follow how "field" maps to the "html body"
Parsing a directory-full-ofHTML is a circa-1999 problem, so I don't expect a lot of people
I've looked at the SimplePostTool.java (implementation of bin/post)...no real anwer.
Data Config Xml File
<dataConfig>
<dataSource type="BinFileDataSource"/>
<document>
<entity name="file" processor="FileListEntityProcessor"
dataSource="null"
htmlMapper="true"
format="html"
baseDir="/usr/local/var/www/confluence/OPERATIONS"
fileName=".*html"
rootEntity="false">
<field column="file" name="id"/>
<entity name="html" processor="TikaEntityProcessor"
url="${file.fileAbsolutePath}" format="text">
<field column="title" name="title" meta="true"/>
<field column="dc:format" name="format" meta="true"/>
<field column="text" name="text"/>
</entity>
</entity>
</document>
</dataConfig>
I ended up writing a few lines of code to parse the html files (jsoup) and ditched the solr data import handler (DIH).
Very straightforward using Spring and solr and jsoup html parser.
One caveat: my java "bean" object to store the solr fields needed a "text" field for the out-of-the-box default-search-field to work (i.e. with the solr docker instance)
Solr version :: 6.6.1
I am new to the Apache Solr and currently exploring how to use this technology to search in the PDF files.
https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html#the-tikaentityprocessor
I am able to index the PDF files using the "BinFileDataSource" for the PDF files within the same server as shown in the below example.
Now i want to know if there is a way to change the baseDir pointing to the folder present under a different server.
Please suggest an example to access the PDF files from another server. How will i write the path in the baseDir attribute.
<dataConfig>
<dataSource type="BinFileDataSource"/> <!--Local filesystem-->
<document>
<entity name="K2FileEntity" processor="FileListEntityProcessor" dataSource="null"
recursive = "true"
baseDir="C:/solr-6.6.1/server/solr/core_K2_Depot/Depot" fileName=".*pdf" rootEntity="false">
<field column="file" name="id"/>
<field column="fileLastModified" name="lastmodified" />
<entity name="pdf" processor="TikaEntityProcessor" onError="skip"
url="${K2FileEntity.fileAbsolutePath}" format="text">
<field column="title" name="title" meta="true"/>
<field column="dc:format" name="format" meta="true"/>
<field column="text" name="text"/>
</entity>
</entity>
</document>
</dataConfig>
I finally found the answer from the solr-user mailing list.
Just change the baseDir to the folder present on another server (SMB paths works directly):
baseDir="\\CLDServer2\RemoteK2Depot"
I'm having issues figuring out exactly how to import blob data from a SQL Server database into SOLR.
This is hooked into NAV as well. I've managed to get the data out of the table within NAV, however I need this data in SOLR for search purposes.
Here's my current dataConfig file.
<dataConfig>
<dataSource name="dastream" type="FieldStreamDataSource" />
<dataSource name="db" driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://localhost;databaseName=TestingDB" user="sa" password="*******" />
<document name="items">
<entity name="item" query="select [No_], [Desc_ English] as desceng from [Foo$Item]" dataSource="db">
<field column="No_" name="id" />
<entity processor="TikaEntityProcessor" url="desceng" dataField="item.desceng" name="blob" dataSource="dastream" format="text" >
<field column="text" name="desceng" />
</entity>
</entity>
</document>
</dataConfig>
The error I keep getting is:
Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.RuntimeException: unsupported type : class java.lang.String
I'm not sure what I'm missing.
Maybe this is because Nav stores blobs in his own way. See this question. There ia an example how to extract data using python.
I need to search inside the file contents for which I am using Solr Data Import Handler. The response should show the content line where the search word is appearing. So for processing line by line I am using Line Entity Processor. My data-config file is
<dataConfig>
<dataSource type="BinFileDataSource" name = "fds"/>
<document>
<entity name="filelist" processor="FileListEntityProcessor" fileName="sample.docx"
rootEntity="false" baseDir="C:\SampleDocuments" >
<entity name="fileline" processor="LineEntityProcessor"
url="${filelist.fileAbsolutePath}" format="text">
<field column="linecontent" name="rawLine"/>
</entity>
</entity>
</document>
The schema.xml is having entry or rawLine.
<field name="rawLine" type="text" indexed="true" stored="true"/>
But when I am running the command for full-import, its throwing an exception
DataImportHandlerException:java.lang.ClassCastException: java.io.FileInputStream cannot be cast to java.io.Reader
Please help me on this as I have spend few days on this problem.
BinFileDataSource works with InputStream while FileDataSource.
You can try using the FileDataSource instead to check for the Casting issue.
<dataSource type="FileDataSource" name = "fds"/>
Looked up information provided on a related question to set up a import of all documents that are stored within a mysql database.
you can find the original question here
Thanks to steps provided I was able to make it work for me with mysql DB. My config looks identical to the one mentioned at above link.
<dataConfig>
<dataSource name="db"
jndiName="java:jboss/datasources/somename"
type="JdbcDataSource"
convertType="false" />
<dataSource name="dastream" type="FieldStreamDataSource" />
<dataSource name="dareader" type="FieldReaderDataSource" />
<document name="docs">
<entity name="doc" query="select * from document" dataSource="db">
<field name="id" column="id" />
<field name="name" column="descShort" />
<entity name="comment"
transformer="HTMLStripTransformer" dataSource="db"
query="select id, body, subject from comment where iddoc='${doc.id}'">
<field name="idComm" column="id" />
<field name="detail" column="body" stripHTML="true" />
<field name="subject" column="subject" />
</entity>
<entity name="attachments"
query="select id, attName, attContent, attContentType from Attachment where iddoc='${doc.id}'"
dataSource="db">
<field name="attachment_name" column="attName" />
<field name="idAttachment" column="id" />
<field name="attContentType" column="attContentType" />
<entity name="attachment"
dataSource="dastream"
processor="TikaEntityProcessor"
url="attContent"
dataField="attachments.attContent"
format="text"
onError="continue">
<field column="text" name="attachment_detail" />
</entity>
</entity>
</entity>
</document>
</dataConfig>
I have a variety of attachments in DB such as jpeg, pdf, excel, doc and plain text. Now everything works great for most of the binary data (jpeg, pdf doc and such). But the import fails for certain files. It appears that the datasource is set up to throw an exception when it encounters a String instead of an InputStream. I set the onError="continue" flag on the entity "attachment" to ensure that the DataImport went through despite this error. Noticed that this problem has happened for a number of files. The exception is given below. Ideas ??
Exception in entity : attachment:java.lang.RuntimeException: unsupported type : class java.lang.String
at org.apache.solr.handler.dataimport.FieldStreamDataSource.getData(FieldStreamDataSource.java:89)
at org.apache.solr.handler.dataimport.FieldStreamDataSource.getData(FieldStreamDataSource.java:48)
at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:103) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468)
I know this is an outdated question, but:
it appears to me that this exception is thrown when the BLOB (I work with Oracle) is null. When I add a where clause like "blob_column is not null", the problem disappears for me (Solr 4.10.1)