I am wondering what is the difference between content field vs _text_ field. I had an issue where I indexed all of my documents/pdfs, but for some reason I could not access the actual text/info in those documents/pdfs. I noticed I had no "content" field so I just created one and am currently reindexing. However, I noticed there is a _text_ field I have that has stored=false. Do both of these fields take all the text from documents/pdfs?
The _text_ is a field defined by default on a new Solr core (see https://lucene.apache.org/solr/guide/7_5/schemaless-mode.html).
The default managed-schema file in a new Solr core does not show anything to indicate that it is populated with anything, so I suspect it's up to you to populate it.
The _text_ field can be used to dump a copy of all the text in the document but this is something that you have to do (either manually populating the _text_ field or using copyFields.)
The fact that _text_ is indexed but not stored means that you can search for text inside of it (because it's indexed) but you cannot fetch and display its value to the user (because it is not stored).
Related
Apache solr search for multiple fields without specifying field name in solr 7.7.2 version. Created copy field for all fields and assigning it to dest=“text” which is field of text type. But it doesn’t give any output. It works for only one field where df=fieldName.
It has managed schema which automatically override the changes after indexing, please let me know what would be the issue.
When I add any field in Solr and then index some data, Solr creates a copy field for this field.
For example I added a field named app_id and after indexing there are data both in app_id and another field named app_id_str.
Is there any way to prevent creating these copy fields ?
I am assuming you are using a reasonably new Solr version. (I do not have enough reputation to comment on the problem yet) You can prevent Solr from automatically creating copy fields during index time. You just have to configure the "add-schema-fields" update processor not to create copy fields on the fly. Here is how,
Open the solrconfig.xml file of the core you wish to disable adding copy fields automatically.
Comment out the configuration to disable the copy field creation on text fields (or any type of field that is configured to generate a copy field).
Save and restart the Solr instance.
Index the documents.
Schema.xml
Search for copyField definitions using wildcards in their glob pattern in schema.xml.
The copyField command can use a wildcard (*) character in the dest
parameter only if the source parameter contains one as well. copyField
uses the matching glob from the source field for the dest field name
into which the source content is copied.
You need to comment anything that looks like this :
<copyField source="*" dest="*_str"/>
You may also have some dynamicField definitions like the following that would create any copied fields (otherwise you would perhaps remember having explicitly defined such fields like app_id_str) :
<dynamicField name="*_str" type="string"/>
SchemaLess Mode
Internally, the Schema API and the Schemaless Update Processors both
use the same Managed Schema functionality.
If you are using Solr in "schemaless mode", you can do the same either by using the Schema API :
Delete a Copy Field Rule
Delete a Dynamic Field Rule
Or by reconfiguring the dedicated update processor in solrconfig.xml as stated by Kusal.
See the paragraph titled You Can Still Be Explicit below this section.
I am having an issue with the partial update in SOLR. As I am having some non-stored fields in my collection the values in the non stored fields gone after the partial update. So, is that possible to use copy field to copy the original content for the non stored field from a different collection?
No. copyFields are invoked when a document is submitted for indexing, so I'm not sure how that would semantically work either. In practice what a copyField instruction does is to duplicate the field value when the document arrives to the server and copy it into fields with other names. That assumption won't make sense if there's a different collection involved - does it get invoked when documents are submitted for the other collection? (if that's the case - what with the other fields local to the actual collection).
Set the fields to stored if you want to use partial updates with fields that can't support in place updates (which have very peculiar requirements, such as being non-stored, non-indexed, single valued and has numeric docValues).
I posted 3 documents from post.jar and they successfully posted and i also searched any word of those documents so it returns correct document but when i partial update the document means just update one field then then after updating i once again searched for a word but it doesn't reply successfully.means after partial update it lost the contents of the documents. the fields which i updated are defined by me manually means out of those fields which build itself by post.jar.
so what is the solution that after partial update it remains same
Assuming by "partial update" you are talking about the Atomic Update feature, then this will apply:
In order for Atomic Update to not lose data, all fields in your schema that are not copyField destinations must have stored="true". All fields that ARE copyField destinations must have stored="false".
Further details required for proper Atomic Update operation: The information in copyField destinations must only originate from copyField sources. If some information in copyField destinations originates from the indexing source and some of it comes from copyField, then the information that originated from indexing will be lost when Atomic Update is used.
Also see the "Field Storage" section found on this page from the Solr documentation:
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents#UpdatingPartsofDocuments-AtomicUpdates
I solved by problem by making stored=false to all dynamic fields and remove copy field of text
As all fields are copied in to text field so after doing these changes my problem becomes solve.
I searched for documents and cannot find where it store all data.
I want to access all crawled data in order to do my own processing.
In the file StartStopListener it sets up the index directories: look for the value of the environment values OPENSEARCHSERVER_DATA, OPENSEARCHSERVER_MULTIDATA, or OPENSHIFT_DATA_DIR.
Now, whether you'll be able to parse the files easily/correctly is another debate: I haven't ever tried to directly open a search server's indexes by hand, and I don't know that the index format is well documented.
By default, the crawled data are not stored. Only the extracted text is stored. It is possible to store the crawled data, here is the process:
Create a new field: Set the "stored" parameter to yes or to compressed.
Go to the Schema / Parser List
Edit the HTML parser
In the "Field Mapping" tab, link the parser field "htmlSource" to the new field.
Restart the indexation process. Now, all crawled data will be copied to this field. Don't forget to add it as returned field in your query.