Getting date metadata using SolrCell - solr

I'm using Solr 3.6 to index many different types of documents. I have several fields that define common information for all the documents, one of them being 'date' (ideally last modified date, just something to indicate how recent a document is.)
<field name="date" type="date" indexed="true" stored="true" required="true" />
My problem arises when trying to index rich text documents like .docx and .pdf. I want to fill in the date field using metadata that I get from the ExtractingRequestHandler, but the name of the field that the date information I want is stored in is different for each file. Sometimes the field I want is 'date', othertimes it's 'last_modified' or 'last_save_date'. I was trying to use 'last_modified' to provide the date in the handler:
<str name="fmap.last_modified">date</str>
..but this led to problems where date was either multivalued (since there was 'date' metadata) or undefined (because 'last_modified' didn't exist). I looked into using conditional copyFields to try to extract data from at least one of these fields, but that seems complicated (i.e. extending the update handler) and would also require that I know the name of every possible field that could contain this date information.
Is there any way that I can reliably extract a date from every rich-text document that I process?

Related

Apache solr date field in views

I have a custom date field in one of my content type field_last_archived_date.
There is a corresponding entry in the Apache solr field list called dm_field_last_archived_date.
Now there are two problems that I am facing
When I try to use this field in a solr view to sort the same, it gives me error "cannot sort on multivalued field."
When I try to use this field as an exposed filter to provide a date range, I'm not sure what date format should be given. I have tried formats like "2011-10-01T23:59:59Z", "2011-10-01 23:59:59", plain unix timestamp, etc. But all of them throws error "Invalid Date String:'OctoberAMCECESTAM+02:001_SunAMCESTE_1nd+02008601'".
Any idea what I am doing wrong here?
Thanks...
dm_field_last_archived_date field is multi value field and solr is not provide sorting on multi value field.
To confirm behavior apply sort on single value field.
You can check multi value in schema file in solr it looks like
<field name="yourFieldName" type="tint" indexed="true" stored="true" omitNorms="true" multiValued="true" default="defaultValue"/>

Solrj indexing mechanism

I have a question about indexing mechanism using Solr in Java. If I create a documents and i want to find only field "name", solr will be index all fields? Or only index by field "name" in each document?
If you tell Solr to only store the field name in your schema, then only the field name will be stored.
If you instruct Solr to store everything you send to it (like in the schemaless mode) and you send 400 fields, each of those fields will be stored.
If you want to store information but not search for it, only those fields which you are going to query need to be indexed, while the other fields can be limited to just stored. If you don't need the content of the field, but just want to search for it, you can set stored to false, and indexed to true.
In the schema.xml where you define the fields getting used, you need to mention indexed=true for all the fields you want to search on.
In your case it would look something like this -
<field name="name" type="string" indexed="true" stored="true" />

Solr index vs stored

I am a little confused as to what the behaviour of the index and stored attibutes of the Solr fields is.
For example if I have the following in the Schema.xml
<field name="test1" type="text" indexed="false"
stored="false" required="false" />
Will the field test1 be not stored in the Solr document even if I create a document with that field in it and set a value to that field and commit the document to Solr. As I have the stored=false attribute, does it mean that the value of the field is lost in Solr and not persisted?
That is correct. Typically you will want your field to be either indexed or stored or both. If you set both to false, that field will not be available in your Solr docs (either for searching or for displaying). See Alexandre's answer for the special cases when you will want to set both to false.
As stated here : indexed=true makes a field searchable (and sortable and facetable). For eg, if you have a field named test1 with indexed=true, then you can search it like q=test1:foo, where foo is the value you are searching for. If indexed=false for field test1 then that query will return no results, even if you have a document in Solr with test1's value being foo.
stored=true means you can retrieve the field when you search. If you want to explicitly retrieve the value of a field in your query, you will use the fl param in your query like fl=test1 (Default is fl=* meaning retrieve all stored fields). Only if stored=true for test1, the value will be returned. Else it will not be returned.
The main point of having both set to false is to explicitly skip that particular field.
For example, if you have a storing/indexing dynamicField mapping and you want to ignore one particular name that would otherwise fall under dynamicField's pattern.
Alternatively you could use dynamicField to ignore a whole set of fields with same prefix/suffix that comes from a 3rd party. For example, Tika will send you a whole bunch of metadata fields which you may just want to ignore. See this defined in Solr's example schema.xml and used in solrconfig.xml
In the later versions of Solr, you could also use IgnoreFieldUpdateProcessorFactory (see full list for others) instead, which will get rid of those fields even earlier in the indexing process.
Quoting from this response in the Solr's mail thread:
"indexed" and "stored" are independent, orthogonal attributes - you can use
any of the four combinations of true and false. "indexed" is used for search
or query, the "lookup" portion of processing a query request. Once the
search/query/lookup is complete and a set of documents is selected, "stored"
is the set of fields whose values are available for display or return with
the Solr response.
Part of the reason for the separation is that Solr/Lucene "analyzes" or
transforms the input data into a more efficient form for faster and more
relevant search/lookup. Unfortunately, that analyzed/transformed data is
frequently no longer suitable for display and human consumption. In other
words the analysis/transformation is not bidirectional/reversible. Setting
"stored=true" guarantees that the original data can be retrieved in its
original form.
If both are false you loose your data in that field. If indexed true, the data are searchable but it can not be displayed. If you set stored true you will not be able to search on that field but it can be displayed (in this case you can write copyfield rule to copy the info from that field to the default searchable field). Both set as true -> you can search and display.
indexed = true means that this field can be used in the search.
For example, if I set the item field as follows and I try to perform the field in a search
<field name="item" type="text_general" uninvertible="true" indexed="false" stored="true"/>
fq = item: "Tennis" will mark an error.
stored = true means that this field can be retrieved in the list of fields displayed after a query.
For example, if the item field is defined as follows
<field name="item" type="text_general" uninvertible="true" indexed="true" stored="false"/>
You will be able to search fq = item: "Tennis" correctly, but it will not return the item field in the results.
Regards

Know indexing time for a document in Solr

Is it possible to know the indexing time of a document in solr. Like there is a implicit field for "score" which automatically gets added to a document, is there a field that stores value of indexing time?
I need it to know the date when a document got indexed.
Thanks
Solr does not automatically add a create date to documents. You could certainly index one with the document though, using Solr's DateField. In earlier versions or Solr ( < 4.2 ), there was a commented timestamp field in the example schema.xml, which looked like:
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
Also, I think it bears noting that there is no implicit "score" field. Scores are calculated at query time, rather than being tied to the document. Different queries will generate different scores for the same document. There are norms stored with the document that are factored into scores, but they aren't really fields.
femtoRgon give you a correct solution but you must be carefull with partial document update.
If you do not do partial document update you can stop reading now ;-)
If you partially update your document, SolR will merge the existing value with your partial document and the timestamp will not be updated. The solution is to not store the timestamp, then SolR will not be able to merge this value. The drawback is you cannot retrieve the timestamp with your search result.

Solr Search not working after dataimport successful

I am new in Solr. I have tried DataImport using a Oracle Database. The data gets successfully imported. When I try to search with query:
qt=standard
q=*
I get good results. But when I do a specific search, the results are empty showing no documents. The logger is empty and there are NO errors displayed.
Ok! I got it.
I observed that when I am using some pre-defined fields of schema.xml, the search on those fields are working fine. But when I defined some fields of my own, the result was still NOTHING.
Then I looked into "solr-config.xml's" "/select" request handler. There is a line
<str name="df">text</str>
which says that "txt" is the only field which is searchable. But then how does it searches the other fields?
Answer lies in "schema.xml's"
"<copyField>"
tag. The fields present by default are copied into "text" which makes them searchable. Hence if you want your defined field as searchable, just define your field and add it in copyField tag. ;)
TLDR Version: Define your fields as type="text" to start off. If you have a field called "product", add <field name="product" type="text" indexed="true" stored="true" /> to the default schema.xml inside the <fields> tag and you should be done. To search using the select request-handler, use q=<field_name>:<text_to_look_for> or q=*:* to show all documents.
There are a few mistakes you're making here. I'll be explaining using the 'select' request handler.
The format for a query is ?q=<field_name>:<text_to_look_for>. So if you want to return all the values matching all the fields, you'd say q=*:*
And if you were to look for the word "iPod" in the field "product" your query would be q=product:iPod
Another thing to keep in mind is that if in schema.xml, say if you specify the field product as type="string" which maps to class="solr.StrField", the query (<text_to_look_for>) should precisely match the value in the index, since Solr doesn't tokenize the StrField by default, i.e., ipod will not return results if your index holds it as iPod. If you need it to return it still, you could use the type="text" in schema.xml (the fieldType definition is present already in the default schema.xml.) The "text" fieldType has several analyzers(one analyzer ignores case) and tokenizers(tokenizer splits up the words in the field and indexes them so that if you search for a particular word, say "ipod", it would match the value "iPod 16GB White").
Regarding your own answer, the <str name="df">text</str> specifies the default field to search in, i.e, if you just said q=iPod, it would look in this field. The objective of this field called text is to hold all the other fields in the document, so that you could just search in this field and know that some or the other field in this document would match your query, thereby you wouldn't need to search in a specific field if you don't know what field you're expecting the value to be in.

Resources