Solr-Retrieve name of document where the word is found - solr

I am using queries (Solr Admin) to search words through two text documents that are in my HDFS. How can i retrieve the name of the document that the word is found in. I am using this project https://github.com/lucidworks/hadoop-solr
I am creating a collection using bin/solr -e cloud and i am using "data_driven_schema_configs" from server/solr/configsets/ directory.
I tryied adding <field name="fileName" type="string" indexed="true" stored="true" /> inside managed-schema at ~/solr-6.1.0/server/solr/configsets/data_driven_schema_configs/conf, and also change it name to schema.xml, but in this directory there isn't any dataConfig file to add <field column="file" name="fileName"/> as i see it in some other posts with similar questions, but not for SolrCloud, so i don't know if that i am trying is correct. What changes, and in which directories, i have to do, to be able to make it happen.
Example: I am searching the word "greatest" which can found in both documents. How can i see in which document is every result, sample1.txt or sample2.txt

Same thing I said when you mentioned this question on IRC:
Your Solr schema must contain a field where you put the name, set to stored="true", and you must include that field, with a relevant value, in every document when you index. Most schema changes require a full reindex.
https://wiki.apache.org/solr/HowToReindex

Related

Additional fields in schema.xml are not showing up when I do a query

Solr version information: 6.6.0
The core is named: solr
Instance: /var/solr/data/new_core
In the /var/solr/data/new_core/conf/ directory I have a custom schema.xml file
I have multiple custom fields like this in the schema.xml file
<field name="nid" type="int" indexed="true" stored="true"/>
When I select the 'solr' core and go to query, these custom fields are not showing up in the results. Here's an example of the results:
{
"response":{"numFound":200,"start":0,"docs":[
{
"id":"koe1eh/node/49",
"site":"https://example.com:1881/",
"hash":"koe1eh",
"ss_language":"und",
"url":"https://example.com:1881/node/49",
"ss_name":"tfadmin",
"tos_name":"tfadmin",
"ss_name_formatted":"tfadmin",
"tos_name_formatted":"tfadmin",
"is_uid":1,
"bs_status":true,
"bs_sticky":false,
"bs_promote":false,
"is_tnid":0,
"bs_translate":false,
"ds_created":"2009-03-12T17:46:06Z",
"ds_changed":"2009-06-18T15:25:33Z",
"ds_last_comment_or_change":"2009-06-18T15:25:33Z",
"tos_content_extra":" (Gifts) ",
"sm_field_apptype":["mousepad"],
"_version_":1588589404094464000,
"timestamp":"2018-01-03T16:28:34Z"}]
}}
The query performed is: http://example.com/solr/solr/select?indent=on&q=*:*&rows=1&wt=json
solrconfig.xml is here: https://pastebin.com/iVhZCqTW
schema.xml is here: https://pastebin.com/UBaUN5EK
I have tried restarting solr, reloading the core, and reindexing with no effect.
It turns out that I was just looking at some entries that did not have those fields. When I altered my query to start on record 900, then I saw the fields I was looking for. I'm not sure what else I may have done to get this working as I've been trying many different things.

Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

I am new to Solr and I need to implement a full-text search of some PDF files. The indexing part works out of the box by using bin/post. I can see search results in the admin UI given some queries, though without the matched texts and the context.
Now I am reading this post for the highlighting part. It is for an older version of Solr when managed schema was not available. Before fully understand what it is doing I have some questions:
He defined two fields:
<field name="content" type="text_general" indexed="false" stored="true" multiValued="false"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
But why are there two fields needed? Can I define a field
<field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/>
to capture the full text?
How are the fields filled? I don't see relevant information in TikaEntityProcessor's documentation. The current text extractor should already be Tika (I can see
"x_parsed_by":
["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"]
in the returned JSON of some query). But even I define the fields as he said I cannot see them in the search results as keys in JSON.
The _text_ field seems a concatenation of other fields, does it contain the full text? Though it does not seem to be accessible by default.
To be brief, using The Elements of
Statistical Learning as an example, how to highlight the relevant texts for the query "SVM"? And if changing the file name into "The Elements of Statistical Learning - Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the query "id:Trevor Hastie"?
Before I get started on the questions let me just give a brief how solr works. Solr in its core uses lucene when simply put is a matching engine. It creates inverted indexes of document with the phrases. What this means is for each phrase it has a list of documents which makes it so fast. Getting to your questions:
Solr does not convert your pdf to text,well its the update processor configured in the handler which does it ,again this can be configured in solrconfig.xml or write your own handler here.
Coming back why are there two fields. To simply put the first one(content) is a stored field which stores the data as it is. And the second one is a copyfield which copies the data for each document as per the configuration in schema.xml.
We do this because we can then choose the indexing strategy such as we add a lowercase filter factory to text field so that everything is indexed in lower case. Then "Sam" and "sam" when searched returns the same results.Or remove certain common occurring words such as "a","the" which will unnecessarily increase your index size. Which uses a lot of memory when you are dealing with millions of records, then you want to be careful which fields to index to better utilise the resources.
The field "text" is a copyfield which copies data from certain fields as mentioned in the schema to text field. Then when searching in general one does not need to fire multiple queries for each field. As everything thing is copied into "text" field and you get the result. This is the reason it's "multivaled". As it can stores an array of data. Content is a stored field and text is not,and opposite for indexed because when you return your result to the end user you show him what ever you saved not the stripped down data that you just did with the text field applying multiple filters(such as removing stop words and applying case filters,stemming etc).
This is the reason you do not see "text" field in the search result as this is used solr.
For highlighting see this.
For more these are some great blog yonik and joel.
Hope this helps. :)

Solr click scoring implementation

after searching and searching over the net, i've found a possible open-source solution for the click-count-popularity in solr (=does not require a payd version of lucid work search).
In my next two answers i will try to solve the problem in a easy way and in a way a little bit complex...
But first some pre-requisites.
We suppose to google-like scenario:
1. the user will introduce some terms in a textfield and push the search button
2. the system (a custom web-app coupled with solr) will produce a web page with results that are clickable
3. the user will select one of the results (e.g. to access to the details) and will inform the system to change the 'popularity' of the selected result
The very easy way.
We define a field called 'popularity' in solr schema.xml
<field name="popularity" type="long" indexed="true" stored="true"/>
We suppose the user will click on the document with id 1234, so we (=the webapp) have to call solr to update the popularity field of the document with id 1234 using the url
http://mysolrappserver/solr/update?commit=true
and posting in the body
<add>
<doc>
<field name="id">**1234**</field>
<field name="popularity" update="inc">1</field>
</doc>
</add>
So, each time the webapp will query something to solr (combining/ordering the solr 'boost' field with our custom 'popularity' field) we will obtain a list ordered also by popularity
The more complex idea is to update the solr index tracing not only the user selection but also the search terms used to obtain the list.
First of all we have to define a history field where to store the search terms used:
<field name="searchHistory" type="text_general" stored="true" indexed="true" multiValued="true"/>
Then we suppose the user searched 'something' and selected from the result list the document with id 1234. The webapp will call the solr instance at the url
http://mysolrappserver/solr/update?commit=true
adding a new value to the field searchHistory
<add>
<doc>
<field name="id">**1234**</field>
<field name="searchHistory" update="add">**something**</field>
</doc>
</add>
finally, using the solr termfreq function in every following query we will obtain a 'score' that combined with 'boost' field can produce a sorted list based of click-count-popularity (and the history of search terms).
This is interesting approach however I see some disadvantages in it:
Overall items storage will grow dramatically with each and every search.
You're assuming that choosing specific item is 100% correct and it wasn't done by mistake or for brief only. In this way you might get wrong search results along the way.
I suggest only to increment the counter or even to maintain relative counter based on the other results that the user didn't click it.

SOLR 4.3 Modification of schema.xml not considered by the server

I have the following error: [doc=testIngestID411] unknown field 'dateImport'
At the beginning I did not have the field 'dateImport' in my solr schema. I decided to add it after launching solr a few times.
1. I added this field to schema.xml:
<filed name="dateImport" type="string" indexed="true" stored="true" required="true"/>
after the other pre-existing fields.
I removed all my existing documents using :
<delete><query>*:*</query></delete>
Stopped SOLR (using ctrl+c or by killing the jar process)
Restarted SOLR (using java -jar start.jar)
Then, when I try to insert a document with a filed named dateImport I got :
"unknown field 'dateImport'"
Extra information:
If I modify one field which existed before (i.e which was there the first time I launched this SOLR core) the modification is well considered. For instance, if I change one field that was not required for required=true (and restart solr). Then I cannot add a document without specifying this field.
Also I have noticed, using the web admin interface:
On the left there is a tab call "Schema", this schema contains all modifications (like the field dateImport). Above this tab there is another tab named "Schema Browser". The field 'dateImport' DOES NOT appear here :( .
What can I do to get this new field working??
Thank you
Change <filed ... to <field ...

SOLR - Use single text field in schema for full text search

I am getting familiar with SOLR.
I would like to use SOLR for full text search for many kind of entities. I don't want to create a Document for every different type of entity. I don't want to be able to search for specific fields. I am only interested in that if a specified string is anywhere in any item.
In database terms for example I have a table News and a table Employee and I want to search for the word 'apple', I don't mind in which field it is, I only want to get back the database ID from the records which contain it.
Could it be a solution, that I use a SOLR schema something like this:
<fields>
<field name="id" type="string" indexed="true" stored="true"/>
<field name="content" type="text" indexed="true" stored="false"/>
</fields>
So, I only need an ID and the contents. I put all the data, in which I want to be able search into one 'content' field. When I search for some words it looks for it in the 'id' and int the 'content'.
Is this a good idea? Any performance or design problem?
Thanks,
Tamas
See https://wiki.apache.org/solr/SchemaXml#Copy_Fields. It says:
A common requirement is to copy or merge all input fields into a single solr field. This can be done as follows:-
<copyField source="*" dest="text"/>
That's typically what is done to search across multiple fields.
But if you don't even want your original fields, just concatenate all your fields into one big field content and index in Solr. There should be no problems with that.
You can either copyField to text (see example in the distribution) and have that set as default field ("df" parameter in solrconfig.xml for the select handler).
Or, if you anticipate more complex requirements down the line and/or non-text searches, I would recommend looking at eDismax with qf parameter and it will handle searching all those fields itself.

Resources