I want to search Solr for server names in a set of Microsoft Word documents, PDF, and image files like jpg, gif.
Server names are given by the regular expression (regex):
INFP[a-zA-z0-9]{3,9}
TRKP[a-zA-z0-9]{3,9}
PLCP[a-zA-z0-9]{3,9}
SQRP[a-zA-z0-9]{3,9}
....
Problem
I want to get the text in the documents matching the regex. eg. INFPWSV01, PLCPLDB01.
I've indexed the files using Solr/Tikka/Tesseract using the default schema.
I've used the highlight search tool
hl ticked
hl.usePhraseHighlighter ticked
Solr only returns the metadata (presumably) like filename for the file containing the pattern(s).
Questions
Would I have to have modify the managed schema?
If so would I have to save the file content in the schema?
If so is this the way to do it:
a. solrconfig.xml <- inside my "core"
b. Remove line
as I want meta data
c. Change this in the managed schema
stored=false to stored=true
I am building a search Engine with Solr 4.8.1 - in doing so, I am attempting to display the file names of each indexed document in my GUI search results.
I can successfully display any field that is in Solr's Schema.xml file (title, author, id, resourcename last_modified etc.). I cannot, however, find a field in the schema.xml that holds the name of the file (such as for the file Test.pdf the name "Test" or for Example.docx the word "Example")
The closest field I can find is "resourcename" which displays the entire file path in my system (ex. C:\Users\myusername\Documents\solr-4.8.1\example\exampledocs\filename.docx when all I want to display is filename.docx)
(1) How do I tell solr to index the name of a file?
or
(2) Is there a field that cover the file name that I am just missing?
Sincerest thanks!
---Research Update---
It seems this question is asking for the same thing - Solr return file name - however, I do not believe that simply adding a field called "filename" will cause Solr to index the file name! I know I need to add a field to the Schema.xml file - now how do I point that field to the name of a file?
This is not so much a question regarding solr functionality as it is about the tools you use to publish to solr. While adding a new field called fileName to solr will resolve part of the issue, modifying the publish tool to add the testPDF.pdf value to each . I guess i'd point my eyes at Tika : http://tika.apache.org/ , seeing how you mention both pdf and doc files.
I have a product data in Solr. Solr already provides them into XML via query.
However, I need the data into different XML format (just name of xml nodes are different) for supplying them as a feeds to some other application.
Any idea, how can I do this quickly from Solr?
I finally managed to do this using one of existing response writer. This does not require writing a new response writer. Here is how I did it.
I have used XSLTResponseWriter to generate custom format xml. You can find more details here: http://wiki.apache.org/solr/XsltResponseWriter
You can find more information on how to use response writer here: https://wiki.apache.org/solr/QueryResponseWriter
Okay, now before you use it, it needs to be configured.
Step 1: Define QueryResponseWriter for XSLT in your solrconfig.xml
Add following code into your solrconfig.xml after ending your Query component.
<!--
Changes to XSLT transforms are taken into account
every xsltCacheLifetimeSeconds at most.
-->
<queryResponseWriter name="xslt" class="org.apache.solr.response.XSLTResponseWriter">
<int name="xsltCacheLifetimeSeconds">5</int>
</queryResponseWriter>
You can find its documentation at http://wiki.apache.org/solr/XsltResponseWriter
Step 2. Use proper xslt format or customize your own
You can either use existing xslt formats provided in default Solr download or use it to modify it the way you want it to work. There are 5 example formats provided already. Suppose you use example.xsl which generates a document in html with all fields, then you need to query it like this.
I customized it to use custom.xsl file format to implement my own format. I'll come to it later.
Step 3. Query your Solr using XSLT
http://localhost:8983/solr/mysolrcore/select?q=*:*&wt=xslt&tr=default.xsl&&rows=10
This will query solr and will present data in a format defined in default.xsl. Note the wt and tr parameter. You can pass how many records you want in result in rows.
Custom XML format using XSLT
Here's how I formatted my custom xml format using xslt. Hope this might be helpful to someone.
I have used example_rss.xsl as a base to start with and modified it as following.
<?xml version='1.0' encoding='UTF-8'?>
<!--
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
-->
<!--
Sample transform of Solr query results to custom XML format
-->
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:output
method="xml"
encoding="utf-8"
media-type="application/xml"
/>
<xsl:template match='/'>
<xml version="1.0">
<xsl:apply-templates select="response/result/doc"/>
</xml>
</xsl:template>
<!-- search results xslt -->
<xsl:template match="doc">
<xsl:variable name="id" select="str[#name='Id']"/>
<xsl:variable name="timestamp" select="date[#name='timestamp']"/>
<item>
<id><xsl:value-of select="int[#name='Id']"/></id>
<title><xsl:value-of select="str[#name='Name']"/></title>
<link>
http://localhost:8983/solr/mysolrcore/<xsl:value-of select="string[#name='url']"/>p-<xsl:value-of select="int[#name='Id']"/>
</link>
<image>
<xsl:value-of select="str[#name='ImageURL']"/>
</image>
<category><xsl:value-of select="arr[#name='Category']"/></category>
<availability><xsl:value-of select="bool[#name='StockAvailability']"/></availability>
<description>
<xsl:value-of select="str[#name='ShortDescription']"/>
</description>
</item>
</xsl:template>
</xsl:stylesheet>
This generates a valid xml document without the need of writing your own custom response writer.
You will have to write your own ResponseWriter
Best way is to start looking at an existing implementation. For instance CSVResponceWriter
A quick look at the code tells that you get a SolrQueryResponse object in write method. From the response you can get the searched SolrDocuments and other required information
I know it contains header and file data in raw format, but does this mean everytime i query the index, the raw data is processed to find out the frequency of terms? Since I cannot see a .frq file? Is there any way to find out how the data is stored in .cfs file?
The Index file format is Compound and hence the cfs file created which has all the files combined.
Check File Format which will give a detail for Lucene Index file formats.
You can use Luke to explore your Lucene Index files.
We are having multiple xml files on unix. We need to convert them into flat files. And we did that parsing for one level of xml file using C (C was used as C can communicate with Teradata fastload which is our target box using inmod and it will complete within one parse other wise in other languages we need to do two times parsing one for converting into flat file and one for loading ito teradata). i.e. the below file
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
</book>
Is converted into
bk101~Gambardella, Matthew~XML Developer's Guide~Computer~44.95~
This we achieved by parsing the file in C. But after seeing the original format of xml file which is below. (Please do not consider it as the required file. I am just giving an idea)
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<modified>2010-01-02</modified>
<modified>2010-01-03</modified>
<price>44.95</price>
</book>
This should be converted to two records it seems.
bk101~Gambardella, Matthew~XML Developer's Guide~Computer~2010-01-02~44.95~
bk101~Gambardella, Matthew~XML Developer's Guide~Computer~2010-01-03~44.95~
But now we are feeling that our C code is going to be complex for this req. So we are looking at other options which can be easily used on unix. Can any one please give us any working example codes in different languages/options for unix?
You can use XSLT. I use Saxon (Java) which can be run on Unix.
This stylesheet handles both of your XML samples:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/book">
<xsl:choose>
<xsl:when test="modified">
<xsl:for-each select="modified">
<xsl:call-template name="dump-line">
<xsl:with-param name="pos" select="position()"/>
</xsl:call-template>
</xsl:for-each>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="#id"/><xsl:text>~</xsl:text>
<xsl:value-of select="author"/><xsl:text>~</xsl:text>
<xsl:value-of select="title"/><xsl:text>~</xsl:text>
<xsl:value-of select="genre"/><xsl:text>~</xsl:text>
<xsl:value-of select="price"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:template name="dump-line">
<xsl:param name="pos"/>
<xsl:value-of select="/book/#id"/><xsl:text>~</xsl:text>
<xsl:value-of select="/book/author"/><xsl:text>~</xsl:text>
<xsl:value-of select="/book/title"/><xsl:text>~</xsl:text>
<xsl:value-of select="/book/genre"/><xsl:text>~</xsl:text>
<xsl:value-of select="/book/modified[$pos]"/><xsl:text>~</xsl:text>
<xsl:value-of select="/book/price"/>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
If there are no modified elements, one record is output. If there are modified elements, it outputs as many records as there are modified elements.
Sample output w/modified elements:
bk101~Gambardella, Matthew~XML Developer's Guide~Computer~2010-01-02~44.95
bk101~Gambardella, Matthew~XML Developer's Guide~Computer~2010-01-03~44.95
If you're loading the data into a database, and you have fields that share a many to one relationship with other fields, then you need to make sure your database structure is up to scratch. I.e. one table for the book, and one table for the modification date. Otherwise it will look like there are two books when in fact there is one with two modification dates.
However, if you are loading the data into a database, why are you first converting it to a flat file? You said you wanted to avoid having two passes one the parsing. Well it looks like you'll have one pass to parse the XML and output as a flat file, and another to parse the flat file and enter it into the database. Why not simply parse the XMl and put the data directly into the database?
There are reasons why formats like XML were invented and one is to encapsulate complicated data relationships in text based documents. By converting to a "flat file" you will lose that complexity. If you are then going to import the data into an environment that can handle that complexity and store those relationships...why not keep it?
Does your database have an API, or can it only import flat files?
---EDIT---
It's easier to reply as part of an answer than as a series of comments.
First, thanks for the clarification.
Second, no I cannot provide example code. Mostly since what you want sounds very specific.
Thirdly, I think you have two options:
1) You have a load of C code already written to parse the XML. You have to consider the cost of throwing it all away and writing it again in Perl and supporting that, against the cost of improving it to import data directly into your Teradata database and the cost of maintaining it thereafter.
2) For Perl, there are many XML parsers and in my experience they make traversing an XML tree/data structure much much easier than in C. I'm not a fan of Perl, but I have written code to deal with ready parsed XML trees in C and I have never failed to hate it. By contrast, doing it in Perl is simpler and probably even quicker.
There are a huge number of Perl modules out there to parse XML. I suggest you search the internet for some reviews on them to decide which is easiest or most appropriate for you to use.
There is a Perl module called Teradata::SQL that should allow you to import the data into your Teradata databse. There may be other modules that are easier/simpler/better to use. I have no experience in any of them so cannot make a recommendation. Search http://www.cpan.org for any modules that may be useful.
Lastly, I STRONGLY recommend ensuring that you take some time to ensure that the design of your Teradata database matches the data going into it. As I stated above, you clearly have a many to one relationship between modification dates and books, so that means you need a table for modification dates and a table for books and correct many to one relationships in your table design. To put one entry per line, resulting in multiple lines for the same book with only modification date varying is very wrong. There may be other many to one relationships such as author. Imagine book B written by authors A1 and A2 with modification dates of M1 and M2. If you use the approach you discussed above of having one line for each combination, you end up having 4 entries for the same book, and it looks like you have 2 books with the same title but written by different authors.
Spend some time to ensure you understand the structure of the data in the XML files. This should be clearly defined by the DTD.
XSLT is an option; check out the xsltproc tool.
Or, you can also the much easier XQuery, though you might need to coerce it into producing text. The following XQuery script does almost what you want (only a few fields listed):
for $book in doc("book.xml")/book
for $mod in $book/modified
return concat($book/#id, "~", $book/title, "~", $mod, "
")
You can run this through Saxon with
java net.sf.saxon.Query '!method=text' script.xq
Another popular XQuery processor for Unix is XQilla, though I'm not sure it can produce non-XML output.
(There may be a smart alternative to my awkward way of generating a newline.)
How about formating the line as bk101~Gambardella, Matthew~XML Developer's Guide~Computer~2010-01-02,2010-01-03~44.95~. Of course, special consideration must be taken to the fact that the modified field can contain a list of values. That's about as flat as you can make it.