Apache solr multi field indexing with django haystack - solr

I have a design problem in django-haystack that I don't know how to solve. In my django model, there is a number of text fields that I want index.
As far as I know from the official haystack documentation, the only way to index content is to merge everything you want to index in a single template.
The problem here is that I want to maintain a per-field index, i.e. I want to do a full-text search on each field separately or on a set of fields.
Is there a way do it?

You should just be able to create additional CharFields? The first tutorial page gives an example: http://django-haystack.readthedocs.org/en/latest/tutorial.html#handling-data for field author.

Related

Solr multilingual search

I'm currently working on a project where we have indexed text content in SOLR. Every content is writen in one specific language (we have 4 differents
european languages) but we would like to add a feature that if the primary search (search text entered by the user) doesn't return much result then we try too look for document in other languages. Thus we would somehow need to translate the query.
Our base is that we can have a mapping list of translated words commonly used in the field of the project.
One solution that came to me was to use synonym search feature. But this might not provide the best results.
Does people have pointers on existing modules that could help us achieving this multilingual search feature? Or conception ideas we cold try to investigate?
Thanks
It seems like multi-lingual search is not a unique problem.
Please take a look
http://lucene.472066.n3.nabble.com/Multilingual-Search-td484201.html
and
Solr index and search multilingual data
those two links suggest to have dedicated fields for each language, but you can also have a field that states language, and you can add filter query (&fq=) for the language you have detected (from user query). This is more scalable solution, I think.
One option would be for you to translate your terms at index time, this could probably be done at Solr level or even before Solr at the application level, and then store the translated texts in different fields so you would have fields like:
text_en: "Hello",
text_fi: "Hei"
Then you can just query text_en:Hello and it would match.
And if you want to score primary language matches higher, you could have a primary_language field and then boost documents where it matches the search language higher.

Adding new fields to solr after indexing with Dynamic fields

I am trying to learn and implement Solr for a customer use case where we faced a question what if we need to add more fields(for storing and indexing) is it possible without re-indexing or reloading the data, and when I searched over the net for it and at most of the places it was given that while adding new field that need not to be indexed is okay and can be achieved but we want to add a new indexed field then we have to reload/reindex the data. But then there are dynamic fields in schema.xml which can be used to map to new fields whether they need to be indexed or just stored. My questions is:
If that is a possible workaround to add new fields to existing data/index then why is is not suggested? is there any overhead associated with it or it's fine to use dynamic fields?
Dynamic fields are there so Solr knows how to map your new content to the types. You would still need to reindex the actual documents.
Solr has API/format to partially update a document, so you only need to provide additional information, but under the covers that's still reindexing and you need to be careful that all fields are stored. If a field is store=false and you try partial reindexing, that value will disappear.

cakephp cakedc search without html tags

I'm using the cakedc plugin on cakephp to implement a search on a field in the database (called Post.body). It works fine, but if the field contains html tags (like <p> or <img>, etc), the search will be performed on them as well. Is it possible to filter out / sanitize the search?
Thank you in advance
Should be possible, but it would be highly inefficient in a query as it would most probably make using indices impossible since you have to filter the content of every possible column before performing the actual search.
I'd suggest to store a pre-filtered version of the content in an additional column, and search on that one instead. That way you can continue using simple search conditions and the DBMS can make use of full-text indices.

Using Solr to store user specified information in documents

I have an application that contains a set of text documents that users can search for. Every user must be able to search based on the text of the documents. What is more, users must be able to define custom tags and associate them to a document. Those tags are used in two ways:
1)Users must be able to search for documents based on specific tag ids.
2)There must be facets available for the tags.
My solution was adding a Mutivalued field in each document to pose as an array that contains the tagids that this document has been tagged with. So far so good. I was able to perform queries based on text and tagids ( for example text:hi AND tagIds:56 ).
My question is, would that solution work in production mode in an environment that users add but also remove tags from the documents ? Remember , I have to have the data available in real time, so whenever a user removes/adds a tag I have to reindex that document and commit immediately. If that's not a good solution, what would be an alternative ?
Stackoverflow uses Solr - this is in case if you doubt Solr abilities in production mode.
And although I couldn't find much information on how they have implemented tags, I don't think your approach sounds wrong. Yes, tagged documents will have to be reindexed (that means a slight delay) but other than that I don't see anything wrong with it.

Solr. Store not the original field, but filtered one

I am trying to index Wikipedia's dump. In order to provide abstract for the articles (or, maybe, enable highlighting feature in future) I'd like to store their text without WikiMarkup. For the first try, it would be enough for me to leave just alphanumeric symbols. So the question is it possible to store the field, that is filtered at character level, not the original one?
There is no way to do this out of the box. If you want Solr to do this, you can create your own UpdateHandler, but this might be a little tricky. The easiest way to do this would be to pre-process the document before sending it to Solr.
Solr by default stores original field values before the filters are been applied by the index time analyzers for your fieldType. So by default it is not storing the filtered value. However you have two options for getting the result that you want.
You can apply the same filters to the field at query time as are being applied at index time to remove the wiki markup. Please see Analyzers, Tokenizers and Token Filters on the Solr Wiki for more details.
You can apply the filters to the data in a separate process prior to loading the data into Solr, then Solr will store the filtered values, since you will be passing them in already in a filtered state.

Resources