I'm trying to get started with Apache Solr, but some things are not clear to me. Reading through the tutorial, I've set up a running Solr instance. What I find confusing is that all the configuration of Solr (schemas and so on) are in XML format. When they add sample data, it's shows how to add xml documents (java -jar post.jar solr.xml monitor.xml). Is it just a bad choice of sample format? I mean, are they uploading data describing documents, or the actual documents they're adding are .xml files?
I'm trying to add some books in .txt format, so if I use java -jar post.jar mydoc.txt, am I adding it? How could I add this document and metadata (author, title) about it?
That said, I tried to set up a simple Html page to post documents to Solr:
<html>
<head></head>
<body>
<form action="http://localhost:8983/solr/update?commit=true" enctype="multipart/form-data" method="post">
<input type="file">
<input type="submit" value="Send">
</form>
</body>
</html>
When I try to post a file, I get this response:
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">26</int>
</lst>
</response>
Is this correct? Does it mean that I've successfully added my file? If so, one of the words in the file, for example is "montagna" (this is an italian book, montagna means mountain...). If i visit the url
http://localhost:8983/solr/select/?q=montagna&start=0&rows=10&indent=on
I expect something to be returned (the whole text maybe, or some info about the file), but this is what I get:
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="indent">on</str>
<str name="start">0</str>
<str name="q">montagna</str>
<str name="rows">10</str>
</lst>
</lst>
<result name="response" numFound="0" start="0"/>
</response>
Doesn't seem like a match to me. Also, according to this answer, I should be able to get back the text surrounding the matches with hl.fragsize. How do I integrate this in the search string? Thank you
The solr example adds documents to the index through xml messages. Have a look here. The *.xml you mentioned is because there are some xml messages stored on file systems. Those xml messages are like this:
<add>
<doc>
<field name="id">UTF8TEST</field>
<field name="name">Test with some UTF-8 encoded characters</field>
<field name="manu">Apache Software Foundation</field>
<field name="cat">software</field>
<field name="cat">search</field>
<field name="features">No accents here</field>
<field name="price">0</field>
<!-- no popularity, get the default from schema.xml -->
<field name="inStock">true</field>
</doc>
</add>
It's just a way to represent any kind of document to index. Every document contains one or more fields, and so on. There are different ways to add documents to Solr, for example it accepts also CSV format, but the most common is nowadays the xml format.
I think you aren't actually indexing anything. You can check the output of this query: http://localhost:8983/solr/select/?q=*:* which retrieves all the documents you have in your index. A common error is also forgetting to commit, but I saw you added the commit=true parameter to your url, so that's not your case.
If you want to index just the content of a text file, you could for example define your schema with two fields:
filename
content
and use this message to index your document:
<add>
<doc>
<field name="filename">test.txt</field>
<field name="content">Test with some UTF-8 encoded characters</field>
</doc>
</add>
Do understand the terminology:
Document in solr -> Row in RDBMS
Field of document -> Column of a cell
And a Solr core is of course, both database and gigantic table, occupied in a (potentially) sparse manner.
For your (particular) use, you would create a document for each file; composed of an ID, file content etc.
XML is one way of composing solr operations. http://wiki.apache.org/solr/UpdateXmlMessages
It has the add, delete, commit and optimize operations. The add operation includes one or more documents.
<add>
<doc>
<field name="employeeId">05991</field>
<field name="office">Bridgewater</field>
<field name="skills">Perl</field>
<field name="skills">Java</field>
</doc>
[<doc> ... </doc>[<doc> ... </doc>]]
</add>
There are also CSV (add functionality only), JSON (full functionality), DIH (scheduled database imports).
There is also extracting request handler, which can extract content (and metadata) from all kinds of rich documents (DOC, DOCX, PDF). Additional: there is literal to set your own fields.
The extracting request handler stores its output into the field text. The query parser q= and the highlighter assume a default field (yes, it's pertinent to what you did) of text. You can specify the fields for them; also the fields solr returns to you in results.
Related
I am running Solr 6.4.1 on a Windows 7 machine, with Chrome for testing query URLs currently.
I have set up and got working an index on a set of test documents - a small number of of webpages saved as Docx files in a folder. I can get basic queries working and am now trying to get highlighting working.
I have not modified the schema in any way - simply indexed the folder into a Core called test.
The following query and highlights as I expect:
http://localhost:8983/solr/test/select?hl=on&hl.fl=meta_author&q=steven&wt=xml&fl=meta_author
and returns
...<lst name="highlighting">
<lst name="C:\Users\steven\Documents\Indexing\Dungeon Arena Building.docx">
<arr name="meta_author">
<str><em>steven</em></str>
</arr>
</lst>...
However if I change the fields try and highlight where the term is found in the name of the document it does not work in this way.
http://localhost:8983/solr/test/select?hl=on&hl.fl=dc_title&q=gothic&wt=xml&fl=dc_title
returns
...<lst name="highlighting">
<lst name="C:\Users\steven\Documents\Indexing\Basic Gothic Dungeon.docx"/>
<lst name="C:\Users\steven\Documents\Indexing\Dungeon Arena Building.docx"/>
</lst>...
The results are correct but it does not highlight the identified data fields.
Are there some rules around the available fields that can be highlighted or do I need to amend something in the schema?
For context I aim to bring over all the file content into the index so that I can then present back the match in context of the surrounding text for the users to see.
check whether the field is stored for dc_title .
In your schema your field should look like(field type can be different, as you defined, but set stored=true), after modification, reindex doc and search again.
<field name="dc_title" type="text_general" indexed="true" stored="true"/>
I am trying to import into Solr 5.1.0 and 5.2.1 with a data-config that should produce documents with the following structure:
<parentDoc>
<someParentStuff/>
<childDoc>
<someChildStuff/>
</childDoc>
</parentDoc>
From what I understand from one of the answers on this question about nested entities in DIH, my versions of Solr should be able to create the above structure with the following data-config.xml:
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
url=""
user=""
password=""
batchSize="-1"
/>
<document name="">
<entity rootEntity="true" name="parent" pk="parent_id" query="select * from parent">
<field column="parent_id" name="parent_id" />
<entity child="true" name="child" query="select * from child where parent_id='${parent.parent_id}'">
<field column="parent_id" name="parent_id" />
<field column="item_status" name="item_status" />
</entity>
</entity>
</document>
</dataConfig>
However, when I perform a full-import, I get:
<result name="response" numFound="2" start="0">
<doc>
<long name="parent_id">477</long> <!-- This is from the child -->
<str name="item_status">WS</str>
</doc>
<doc>
<long name="parent_id">477</long> <!-- This is from the parent -->
</doc>
</result>
which I understand is the denormalized layout you're supposed to get pre-5.1.0; however, I expected:
<result name="response" numFound="1" start="0">
<doc>
<long name="parent_id">477</long>
<doc>
<long name="parent_id">477</long>
<str name="item_status">WS</str>
</doc>
</doc>
</result>
What do I need to do to get my desired document structure? Am I misunderstanding what nested entities in the DIH are supposed to do?
Unless someone swings by to tell me otherwise, it seems I have really misunderstood the creation of parent-child docs in Solr 5.1.0+. I was expecting to be able to nest documents and have them returned, but that's not possible with Solr (at least at this point. The future is a mystery.)
Solr is a flat document model. What this means is it doesn't model parent-child relationships in the way I wanted in my original question. There is no nesting. Everything is flat and denormalized.
What Solr does is it adds n-number of child documents next to their parent in a contiguous block. For example:
childDoc1 childDoc2 childDoc3 parent
This structure is actually reflected in the documents I was "mistakenly" getting returned from Solr:
<result name="response" numFound="2" start="0">
<doc>
<long name="parent_id">477</long> <!-- This is from the child -->
<str name="item_status">WS</str>
</doc>
<doc>
<long name="parent_id">477</long> <!-- This is from the parent -->
</doc>
</result>
The nested document support available in the dih after Solr 5.0 is actually an add-on or outright replacement for the old way people used to have to index nested documents, and also seems to take care of updating child + parent docs at the same time for you. Very convenient!
So, then, how do you express a parent-child relationship when Solr destroys that nice, nested document model you had planned? You have to get the parent docs and the child docs and manage the relationship in your application. How do you get the parents and children?
The answer is block joins.
Use block joins during query time, and then in your application, process those documents into your desired structure. Let's look at two block join queries because they can look a bit weird at first.
{!parent which='type:parent'}item_id:5918307
This block join query says, "Get me the parent document that has one or more child documents with the item_id of 5918307."
{!child of='type:parent'}
(fieldA:TERM^100.0 OR fieldB:term^100.0 OR fieldC:term OR (fieldD:term^20.0)) AND (instock:true^9999.0)
This block join query says, "Get me one or more child documents whose parent documents contain the word 'term' and are in stock."
Do NOT search on child fields when doing !child queries. So, referencing the first example, you would not search on item_id, because that would give you a 500 error.
You'll notice the type field in these queries. You do have to add this to your schema and data-config yourself. In the schema, it looks like this:
<!-- use this field to differentiate between parent and child docs -->
<field name="type" type="string" indexed="true" stored="false" />
Then in data-config.xml, just do something like the following for the parent:
if ('true' = 'true', 'parent', 'parent') as type
And the do the same for the child, except substitute "child" where you put "parent" before.
So in the end you might end up making two queries, but it doesn't seem like adding the block join parser adds too much to query time. I'm seeing maybe an extra 50 or 100ms per query.
You can also usually bypass needing nested documents by being smart with your joins. What I've discovered, however, is that because the child documents now mingle with the parent documents, you have one "copy" of each parent with specific child information in your index. In this situation, you would grab the known parent fields from the first document, along with its child fields. For the rest of the documents, you would just grab the child fields.
Another option, when you just want the parent doc and don't want a whole bunch of other docs being returned, is to use grouping queries; however, I wouldn't recommend it. When I tried it on a query that returned a large number of results, I saw query times go from a 10ms - 250ms range all the way up to the 500ms - 1s range.
(Still a newbie; more questions)
I'm performing atomic updates on some SOLR 4 records via HTTP GET calls. This is working correctly after I fixed up some problems with my URLs.
But my original problem is still present: After I update a document, my search queries are no longer finding my updated docs.
Do I need to re-index an updated document? Do atomic updates cause a document to fall out of the index?
example:
I can search with this:
http://solrfarm.gateway.cco:8983/solr/records/select/?q=firstName:(tomas) recordType:(myrectype)&rows=100
and I get XML that looks like:
<doc>
<str name="id">CollName-7276748</str>
<str name="system">OHM Liens</str>
<long name="_version_">1464208859225653248</long>
<bool name="optout">false</bool>
</doc>
I want to change the optout value to "true" and that is happening with a URL that looks like this:
http://prodsolr01.cco:8983/solr/records/update?stream.body=%3Cadd%3E%3Cdoc%3E%3Cfield%20name=%22id%22%3ECollName-7276748%3C/field%3E%3Cfield%20name=%22optout%22%20update=%22set%22%20%3Etrue%3C/field%3E%3C/doc%3E%3C/add%3E&commit=true
Decoded and formatted:
stream.body=
<add>
<doc>
<field name="id">CollName-7276748</field>
<field name="optout" update="set" >true</field>
</doc>
</add>
&commit=true
But, now when I run my original query, my record does not get returned.
If I search for the record explicitly, I get the record:
http://solrfarm.gateway.cco:8983/solr/records/select/?q=id:(%22CollName-7276748%22)%20&rows=100
So I'm confused as to why an updated record is no longer found by my query. Do I need to pass in all the original fields to my update command (i.e. the "firstName" and "lastName" fields that were indexed originally)?
Shouldn't it be enough to just perform the update?
Again, I'm a newbie and I'm probably not "getting" something basic, so all help is appreciated.
I'm trying to understand solr nested queries but I'm having a problem undestading the syntax.
I have the following two indexed documents (among others):
<doc>
<str name="city">Guarulhos</str>
<str name="name">Fulano Silva</str>
</doc>
<doc>
<str name="city">Fortaleza</str>
<str name="name">Fulano Cardoso Silva</str>
</doc>
If I query for q="Fulano Silva"~2&defType=edismax&qf=name&fl=score I have:
<doc>
<float name="score">28.038431</float>
<str name="city">Guarulhos</str>
<str name="name">Fulano Silva</str>
</doc>
<doc>
<float name="score">19.826164</float>
<str name="city">Fortaleza</str>
<str name="name">Fulano Cardoso Silva</str>
</doc>
So I thought that if I queried for:
q="Fulano Silva"~2 AND __query__="{!edismax qf=city}fortaleza" &defType=edismax&qf=name&fl=score
I'd give a bit more score for the second document, but actually I get an empty result set with numFound=0.
What am I doing wrong here?
Need to remove the "=" and replace it with ":" to use the nested query syntax:
q="Fulano Silva"~2 AND _query_:"{!edismax qf=city}fortaleza" &defType=edismax&qf=name&fl=score
*Use _query_: instead of _query_=
Hope this works...
EDIT: When you say q=, are you specifying the query in a URL, or is the text after the q= being put in an application or the Solr dashboard? If we're talking about a URL, you may need to use percent-encoding to get it to work. I mentioned that below, but since I haven't heard from you, I thought I'd reiterate.
Why don't you do q=name:"Fulano Silva" AND city:"fortaleza"?
Another possibility: q=_query_:"{!edismax qf='name'}Fulano Silva" AND city:"fortaleza"
If you're set on a nested query, select?defType=edismax&q="Fulano Silva" AND _query_:"{!edismax qf='city' v='fortaleza'}" should work, but the results and the way it matches will depend on what analyzers you are using to query and index name and city. Also, if these queries are in your query string, make sure you are
encoding them properly.
In order to help you any more, I need to know what you're trying to accomplish with your query. Then perhaps we can be sure you have the right indexing set up, that edismax is the right query handler, etc.
On top of the previous comments, the asker has mispelled _query_ as __query__ (note the double underscore in the second, mispelled, version); Solr expects _query_ to be spelled with only one underscore (_) before and one after the word query, not two.
I have a text-field called name in my schema.xml. A query q=name:(organic) returns the following documents:
<doc>
<str name="id">ontology.category.1483</str>
<str name="name">Organic Products</str>
</doc>
<doc>
<str name="id">ontology.keyword.4896</str>
<str name="name">Organic Stores</str>
</doc>
This is perfectly right in a normal Solr Search, however I would like to construct the query so that it doesn't return anything because 'organic' only matches 1 of the 2 words available in the field.
A better way to say it could be this: Only return results if all tokens in the field are matched. So if there are two words (tokens) in a field and I only match 1 ('organic', 'organics','organ' etc.) I shouldn't get a match because only 50% of the field has been searched on.
Is this possible in Solr? How do I construct the query?
you are probably using StandardTokenizerFactory (or something similar), one solution is to use KeywordTokenizerFactory and issue a phrase query and then only perfect matches will work. Of course remember other filters you might want to use (like LowerCaseFilterFactory etc). Note that: "stores organic" will not match your doc either
Due to time contraints, I had to resort to the following (hacky) solution.
I added the term count to the index via a DynamicField field called tc_i.
<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
Now at query time I count the terms and append it to the query, so q=name:(organic) becomes q=name:(organic) AND tc_i:(1) and this won't return documents for "organic stores" / "organic products" obviously because their tc_i fields are set at 2 (two words).