I'm having issues figuring out exactly how to import blob data from a SQL Server database into SOLR.
This is hooked into NAV as well. I've managed to get the data out of the table within NAV, however I need this data in SOLR for search purposes.
Here's my current dataConfig file.
<dataConfig>
<dataSource name="dastream" type="FieldStreamDataSource" />
<dataSource name="db" driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://localhost;databaseName=TestingDB" user="sa" password="*******" />
<document name="items">
<entity name="item" query="select [No_], [Desc_ English] as desceng from [Foo$Item]" dataSource="db">
<field column="No_" name="id" />
<entity processor="TikaEntityProcessor" url="desceng" dataField="item.desceng" name="blob" dataSource="dastream" format="text" >
<field column="text" name="desceng" />
</entity>
</entity>
</document>
</dataConfig>
The error I keep getting is:
Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.RuntimeException: unsupported type : class java.lang.String
I'm not sure what I'm missing.
Maybe this is because Nav stores blobs in his own way. See this question. There ia an example how to extract data using python.
Related
Solr version :: 6.6.1
I am new to the Apache Solr and currently exploring how to use this technology to search in the PDF files.
https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html#the-tikaentityprocessor
I am able to index the PDF files using the "BinFileDataSource" for the PDF files within the same server as shown in the below example.
Now i want to know if there is a way to change the baseDir pointing to the folder present under a different server.
Please suggest an example to access the PDF files from another server. How will i write the path in the baseDir attribute.
<dataConfig>
<dataSource type="BinFileDataSource"/> <!--Local filesystem-->
<document>
<entity name="K2FileEntity" processor="FileListEntityProcessor" dataSource="null"
recursive = "true"
baseDir="C:/solr-6.6.1/server/solr/core_K2_Depot/Depot" fileName=".*pdf" rootEntity="false">
<field column="file" name="id"/>
<field column="fileLastModified" name="lastmodified" />
<entity name="pdf" processor="TikaEntityProcessor" onError="skip"
url="${K2FileEntity.fileAbsolutePath}" format="text">
<field column="title" name="title" meta="true"/>
<field column="dc:format" name="format" meta="true"/>
<field column="text" name="text"/>
</entity>
</entity>
</document>
</dataConfig>
I finally found the answer from the solr-user mailing list.
Just change the baseDir to the folder present on another server (SMB paths works directly):
baseDir="\\CLDServer2\RemoteK2Depot"
My table has a reference URL as a column of a file, along with the other columns. Sample table as follows, I'm trying to index the table along with the file content in SOLR. The files are accessible via URL with 'http://domain.com/' prefix eg., 'http://domain.com/file/sample1.pdf'. And I will not be able to access these files as fileshares.
Filepath author Title
file/sample1.pdf Jack title 1
file/sample2.pdf Bob title 2
file/sample3.docx Tim title 2
My db-data-import xml is something like this,
<dataConfig>
<dataSource name="dbrows" driver="oracle.jdbc.OracleDriver"
url="jdbc:oracle:thin:#.....
user="***"
password="***"/>
<dataSource type="BinFileDataSource" name="attachments" />
<document>
<entity name="docs" dataSource="dbrows" query="select 'http://domain.com/'||filepath as PATH,author,title from dummytable" >
<entity name="file"
processor="TikaEntityProcessor"
url="${docs.PATH}"
dataSource="attachments"
format="text"
onError="continue"
transformer="script:processFile">
<field column="text" name="text" />
</entity>
</entity>
</document>
</dataConfig>
The error i'm getting is,
2015-10-13 23:15:43.859 WARN (Thread-25) [ x:db] o.a.s.h.d.FileDataSource FileDataSource.basePath is empty. Resolving to: C:\Users\asdf\Downloads\Solr\solr-5.3.1\server\.
2015-10-13 23:15:43.860 ERROR (Thread-25) [ x:db] o.a.s.h.d.EntityProcessorWrapper Exception in entity : file:java.lang.RuntimeException: java.io.FileNotFoundException: Could not find file: http://domain.com/file/sample1.pdf (resolved to: C:\Users\asdf\Downloads\Solr\solr-5.3.1\server\.\http://domain.com/file/sample1.pdf
at org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:126)
at org.apache.solr.handler.dataimport.BinFileDataSource.getData(BinFileDataSource.java:51)
at org.apache.solr.handler.dataimport.BinFileDataSource.getData(BinFileDataSource.java:42)
at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:131)
at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:475)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:514)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461)
Caused by: java.io.FileNotFoundException: Could not find file: http://domain.com/file/sample1.pdf (resolved to: C:\Users\asdf\Downloads\Solr\solr-5.3.1\server\.\http://domain.com/file/sample1.pdf
at org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:122)
... 12 more
2015-10-13 23:15:43.890 WARN (Thread-25) [ x:db] o.a.s.h.d.FileDataSource FileDataSource.basePath is empty. Resolving to: C:\Users\asdf\Downloads\Solr\solr-5.3.1\server\.
Is this even possible? Any help is highly appreciated.
Fixed. Used BinURLDataSource instead of BinFileDataSource
<dataSource type="BinFileDataSource" name="attachments" />
changed this to
<dataSource type="BinURLDataSource" name="attachments" />
I need to search inside the file contents for which I am using Solr Data Import Handler. The response should show the content line where the search word is appearing. So for processing line by line I am using Line Entity Processor. My data-config file is
<dataConfig>
<dataSource type="BinFileDataSource" name = "fds"/>
<document>
<entity name="filelist" processor="FileListEntityProcessor" fileName="sample.docx"
rootEntity="false" baseDir="C:\SampleDocuments" >
<entity name="fileline" processor="LineEntityProcessor"
url="${filelist.fileAbsolutePath}" format="text">
<field column="linecontent" name="rawLine"/>
</entity>
</entity>
</document>
The schema.xml is having entry or rawLine.
<field name="rawLine" type="text" indexed="true" stored="true"/>
But when I am running the command for full-import, its throwing an exception
DataImportHandlerException:java.lang.ClassCastException: java.io.FileInputStream cannot be cast to java.io.Reader
Please help me on this as I have spend few days on this problem.
BinFileDataSource works with InputStream while FileDataSource.
You can try using the FileDataSource instead to check for the Casting issue.
<dataSource type="FileDataSource" name = "fds"/>
in my db-data-config.xml i have configured two datasource, each with his parameter name,
for example:
<dataSource name="test1"
type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/firstdb"
user="username1"
password="psw1"/>
<dataSource name="test2"
type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/seconddb"
user="username2"
password="psw2"/>
<document name="content">
<entity name="news" datasource="test1" query="select...">
<field column="OTYPE_ID" name="otypeID" />
<field column="NWS_ID" name="cntID" />
....
</entity>
<entity name="news_update" datasource="test2" query="select...">
<field column="OTYPE_ID" name="otypeID" />
<field column="NWS_ID" name="cntID" />
....
</entity>
</document>
</dataConfig>
but when in solr from dataimport i execute the second entity-name-query it launch an exception:
"Table 'firstdb.secondTable' doesn't exist\n\tat"
could someone help me? thank you in advance
A think that your query for news_update is wrong. You must have an error on name of table.
I'm pretty sure this question showed up on the solr-user mailing list. The answer given there was that you are using datasource in your entity tags instead of dataSource. It's case sensitive. If I recall the thread correctly, changing this solved your problem.
Looked up information provided on a related question to set up a import of all documents that are stored within a mysql database.
you can find the original question here
Thanks to steps provided I was able to make it work for me with mysql DB. My config looks identical to the one mentioned at above link.
<dataConfig>
<dataSource name="db"
jndiName="java:jboss/datasources/somename"
type="JdbcDataSource"
convertType="false" />
<dataSource name="dastream" type="FieldStreamDataSource" />
<dataSource name="dareader" type="FieldReaderDataSource" />
<document name="docs">
<entity name="doc" query="select * from document" dataSource="db">
<field name="id" column="id" />
<field name="name" column="descShort" />
<entity name="comment"
transformer="HTMLStripTransformer" dataSource="db"
query="select id, body, subject from comment where iddoc='${doc.id}'">
<field name="idComm" column="id" />
<field name="detail" column="body" stripHTML="true" />
<field name="subject" column="subject" />
</entity>
<entity name="attachments"
query="select id, attName, attContent, attContentType from Attachment where iddoc='${doc.id}'"
dataSource="db">
<field name="attachment_name" column="attName" />
<field name="idAttachment" column="id" />
<field name="attContentType" column="attContentType" />
<entity name="attachment"
dataSource="dastream"
processor="TikaEntityProcessor"
url="attContent"
dataField="attachments.attContent"
format="text"
onError="continue">
<field column="text" name="attachment_detail" />
</entity>
</entity>
</entity>
</document>
</dataConfig>
I have a variety of attachments in DB such as jpeg, pdf, excel, doc and plain text. Now everything works great for most of the binary data (jpeg, pdf doc and such). But the import fails for certain files. It appears that the datasource is set up to throw an exception when it encounters a String instead of an InputStream. I set the onError="continue" flag on the entity "attachment" to ensure that the DataImport went through despite this error. Noticed that this problem has happened for a number of files. The exception is given below. Ideas ??
Exception in entity : attachment:java.lang.RuntimeException: unsupported type : class java.lang.String
at org.apache.solr.handler.dataimport.FieldStreamDataSource.getData(FieldStreamDataSource.java:89)
at org.apache.solr.handler.dataimport.FieldStreamDataSource.getData(FieldStreamDataSource.java:48)
at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:103) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468)
I know this is an outdated question, but:
it appears to me that this exception is thrown when the BLOB (I work with Oracle) is null. When I add a where clause like "blob_column is not null", the problem disappears for me (Solr 4.10.1)