I have some doc files in d:/tmp/docs location on my local machine and I want to index them using Apache Solr and Tika. Following is my data-config.xml file.
<dataSource type="BinFileDataSource" />
<document>
<entity name="file_Import" dataSource="null" rootEntity="false"
processor="FileListEntityProcessor"
baseDir="D:/temp/docs" fileName=".*\.(doc)|(pdf)|(docx)"
onError="skip"
recursive="true">
<field column="fileAbsolutePath" name="id" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastModified" />
<entity
name="documentImport"
processor="TikaEntityProcessor"
url="${files.fileAbsolutePath}"
format="text">
<field column="file" name="fileName"/>
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<field column="text" name="text"/>
</entity>
</entity>
</document>
When I try to import those files into solr I get following exception:
Caused by: java.net.MalformedURLException: no protocol: null
at java.net.URL.<init>(Unknown Source)
at java.net.URL.<init>(Unknown Source)
at java.net.URL.<init>(Unknown Source)
at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90)
... 11 more
I figured out that sorl is not able to locate d:/temp/docs folder.
Don't know how to resolve. Any help appreciated.
Resolved ...
I had more than one dataSource tags in my data-config.xml out of which one was <dataSource type="URLDataSource" />
causing a problem.. So I removed all the dataSources and kept only <dataSource type="BinFileDataSource" />
and it worked ... :)
check the url for datasource baseDir
try changing from
baseDir="D:/temp/docs"
to
baseDir="D:/temp/docs/"
and change filename like *.* to index all docs in that folder
Related
how to index files over ftp ,
the FTP repo contain all my documents in different format, i am able to do this task for system folder but it doesn't work with ftp
i have this configuration via (DIH)
<dataConfig>
<dataSource type="BinFileDataSource" />
<dataSource type="BinURLDataSource" name="binSource" baseUrl="ftp://localhost:21/" onError="skip" user="solr_ftp" password="solr_ftp_pass" />
<document>
<!-- baseDir: path to the folder that containt the files (pdf | doc | docx | ...) -->
<entity name="files" dataSource="binSource" baseDir="ftp://localhost" rootEntity="false" processor="FileListEntityProcessor" fileName=".*\.(doc)|(pdf)|(docx)|(txt)|(rtf)|(html)|(htm)" onError="skip" recursive="true">
<field column="fileAbsolutePath" name="filePath" />
<field column="resourceName" name="resourceName" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastModified" />
<!-- tika -->
<entity name="documentImport" processor="TikaEntityProcessor" url="${files.fileAbsolutePath}" format="text">
<field column="title" name="title" meta="true"/>
<field column="subject" name="subject" meta="true"/>
<field column="description" name="description" meta="true"/>
<field column="comments" name="comments" meta="true"/>
<field column="Author" name="author" meta="true"/>
<field column="Keywords" name="keywords" meta="true"/>
<field column="category" name="category" meta="true"/>
<field column="xmpTPg:NPages" name="Page-Count" meta="true"/>
<field column="text" name="content"/>
</entity>
</entity>
</document>
</dataConfig>
Error:
failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' value: ftp://localhost is not a directory Processing Document # 1
I have some troubles when I tried to get files from a ftp severs to extract some metadata with tikaEntityProcessor.
I need a way to pass some credentials to the UrlDataSource.
Can anyone, please, tell me how can I do that?
Example values:
url:
ftp://localhost/Oreilly.Mercurial.The.Definitive.Guide.Jun.2009.pdf
ftp user: alex
ftp password: pass
This is my Data-config.xml
<dataConfig>
<dataSource type="BinURLDataSource" name="binSource"
baseUrl="ftp://localhost:21/" onError="skip" />
<dataSource type="JdbcDataSource"
driver="org.postgresql.Driver"
url="jdbc:postgresql://localhost:5432/files"
user="postgres"
password="admin"
readOnly="true"
autoCommit="false"
transactionIsolation="TRANSACTION_READ_COMMITTED"
holdability="CLOSE_CURSORS_AT_COMMIT"/>
<document>
<entity name="item" query="select* from filesfromftp"
deltaQuery="select url from filesfromftp"
rootEntity="false"
transformer="RegexTransformer">
<field column="url" name="id" />
<entity name="tika-test"
processor="TikaEntityProcessor"
url="${item.url}"
format="none"
dataSource="binSource"
onError="skip">
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<field column="pdf:docinfo:title" name="title" meta="true"/>
<field column="xmpTPg:NPages" name="numPages" meta="true"/>
<field column="Creation-Date" name="createdDate" meta="true"/>
</entity>
</entity>
</document>
</dataConfig>
When I execute Data Import Handler I get this error:
Exception in entity : tika-test:org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in invoking url ftp://localhost/jnioche-bristoljavameetup20150310-150311041443-conversion-gate01.pdf Processing Document # 1
at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
at org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDataSource.java:89)
at org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDataSource.java:38)
at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:128)
at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:244)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:475)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:516)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:475)
at org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:458)
at java.lang.Thread.run(Thread.java:745)
Caused by: sun.net.ftp.FtpLoginException: Invalid username/password
at sun.net.www.protocol.ftp.FtpURLConnection.connect(FtpURLConnection.java:308)
at sun.net.www.protocol.ftp.FtpURLConnection.getInputStream(FtpURLConnection.java:393)
at org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDataSource.java:86)
... 12 more
Please, how can I establish a connection with a FtpServer within SolrDIH?
Is there a way to pass some credentials to UrlDataSource?
There is a patch available herefor this purpose. It is very old, but you could port it to a newer version. Look at a quite recent comment showing how you can create your custom URLDataSource with auth.
Im successfully able to index pdf,doc,ppt,etc files using the Data Import Handler in solr 4.3.0 .
My data-config.xml looks like this -
<dataConfig>
<dataSource name="bin" type="BinFileDataSource" />
<document>
<entity name="f" dataSource="null" rootEntity="false"
processor="FileListEntityProcessor"
baseDir="C:\Users\aroraarc\Desktop\Impdo"
fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)|(pptx)|(xls)|(xlsx)|(txt)" onError="skip"
recursive="true">
<field column="fileAbsolutePath" name="path" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastmodified" />
<field column="file" name="fileName"/>
<entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor"
url="${f.fileAbsolutePath}" format="text" onError="skip">
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<field column="text" name="content"/>
</entity>
</entity>
</document>
</dataConfig>
However in the fileName field i want to insert the pure file name without the extension. Eg - Instead of 'HelloWorld.txt' I want only 'HelloWorld' to be inserted in the fileName field. How do I achieve this?
Thanks in advance!
Check ScriptTransformer to replace or change the value before it is indexed.
Example -
Data Config - Add custom field -
<script><![CDATA[
function changeFileName(row){
var fileName= row.get('fileName');
// Replace or remove the extension .. e.g. from last index of .
file_name_new = file_name.replace ......
row.put(fileName, row.get('file_name_new'));
return row;
}
]]></script>
Entity mapping -
<entity name="f" transformer="script:changeFileName" ....>
......
</entity>
I am trying to scan all pdf/doc files in a directory. This works fine and I am able to scan all documents.
The next thing i'm trying to do is also receiving the filename of the file in the search results. The filename however never shows up. I tried a couple of things, but the documentation is not very helpfull about how to do this.
I am using the solr configuration found in the solr distribution: apache-solr-3.1.0/example/example-DIH/solr/tika/conf
This is my dataConfig:
<dataConfig>
<dataSource type="BinFileDataSource" name="bin"/>
<document>
<entity name="f" processor="FileListEntityProcessor" recursive="true"
rootEntity="false" dataSource="null" baseDir="C:/solrtestsmall"
fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)" onError="skip">
<entity name="tika-test" processor="TikaEntityProcessor"
url="${f.fileAbsolutePath}" format="text" dataSource="bin"
onError="skip">
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<field column="text" name="text"/>
</entity>
<field column="fileName" name="fileName"/>
</entity>
</document>
</dataConfig>
I am interested in the way how to configure this correctly, and also the any other places I can find specific documentation.
You should use file instead of fileName in column
<field column="file" name="fileName"/>
Don't forget to add the 'fileName' to the schema.xml in the fields section.
<field name="fileName" type="string" indexed="true" stored="true" />
There is a problem when I use solr1.3 delta-imports to update the index. I have added the "last_modified" column in the table. After I use the "full-import" command to index the database data, the "dataimport.properties" file contains nothing, and when I use the "delta-import" command to update index, the solr list all the data in database not the lasted data. My db-data-config.xml:
deltaQuery="select shop_id from shop where last_modified > '${dataimporter.last_index_time}'">
<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/funguide" user="root" password="root"/>
<document name="shopinfo">
<entity name="shop" pk="shop_id"
query="select shop_id,title,description,tel,address,longitude,latitude from shop"
<field column="shop_id" name="id" />
<field column="title" name="title" />
<field column="description" name="description" />
<field column="tel" name="tel" />
<field column="address" name="address" />
<field column="longitude" name="longitude" />
<field column="latitude" name="latitude" />
</entity>
</document>
</dataConfig>
Anyboby know how to solve the problem? Thanks!
enzhaohoo#gmail.com
I also would recommend upgrading to Solr 1.4 RC as there have been quite a few improvements made to delta-imports with DataImportHandler. Please see DataImportHandler - Using delta-import command - wikipage for specifics.