I have some questions about how index in Alfresco One works with transactional queries.
We use Alfresco 5.0.2 and in documentation I can read this: "When you are upgrading the database, you can add optional indexes in order to support the metadata query feature."
Suppose that in my model.xml I add a custom property like this:
<type name="doc:myDoc">
<title>Document</title>
<parent>cm:content</parent>
<properties>
<property name="doc:level">
<title>Level</title>
<type>d:text</type>
<mandatory>true</mandatory>
<index enabled="true">
<atomic>true</atomic>
<stored>false</stored>
<tokenised>both</tokenised>
</index>
</property>
...
</properties>
</type>
And I have on my alfresco-global.properties these sets
solr.query.cmis.queryConsistency=TRANSACTIONAL_IF_POSSIBLE
solr.query.fts.queryConsistency=TRANSACTIONAL_IF_POSSIBLE
system.metadata-query-indexes.ignored=false
My first question is... How Alfresco knows which properties I want to index on DB? Read my model.xml and index only the indexed properties that I specify there? Index all the custom properties? Or I need to create a script to add these new indexes?
I read the script metadata-query-indexes.sql but I don't understand how rewrite it in order to add a new index for my property. If it's necessary this script, could you give me an example with the doc:myDoc property that I wrote before, please?
Another question is about query syntax that isn't supported by DB and goes directly to SOLR.
I read that PATH, SITE, ANCESTOR, OR, any d:content, d:boolean or d:any (among others) properties in your query or it will not be executable against the DB. But I don't understand what d:content is exactly.
For example, a query (based on my custom property written before) like TYPE:whatever AND #doc\:level:"value" is considered d:content? This query is supported by BD or goes to SOLR?
I read also this:
"Any property checks must be expressed in a form that means "identical value check" as querying the DB does not provide the same tokenization / similarity capabilities as the SOLR index. E.g. instead of my:property:"value" you'd have to use =my:property:"value" and "value" must be written in the proper case the value is stored in the DB."
This means that if I use the =, for example doing =#doc\:level:"value", this query isn't accepted on DB and goes to SOLR? I can't search for an exact value on DB?
I've been researching TMQs recently. I'm assuming that you need transactionality, which is why TMQ queries are interesting. Queries via SOLR are eventually consistent, but TMQs will immediately return the change. There are certain applications where eventual consistency is a huge problem, so I'm assuming this is why you are looking into them.
Alfresco says that they use TMQs by default, and in my limited testing (200k documents), I found no appreciable performance difference between a solr and TMQ query. I can't imagine they are horrible for performance if Alfresco set it up to be the default style, but I need to do further testing with millions of documents to be sure. It will of course depend on your database load. If your database is a bottleneck and you don't need the transactionality, you could consider using # syntax in metadata searches to avoid them, or you could disable them via properties configuration.
1) How Alfresco knows which properties I want to index on DB? Read my model.xml and index only the indexed properties that I specify there? Index all the custom properties? Or I need to create a script to add these new indexes?
When you execute a query using a syntax that is compatible with a TMQ, Alfresco will do so. The default behavior is "TRANSACTIONAL_IF_POSSIBLE":
http://docs.alfresco.com/4.2/concepts/intrans-metadata-configure.html
You do not have to have the field marked as indexable in the model for this to work. This is unclear from the documentation but I've tried disabling indexing for the field in the model and these queries still work. You don't even have to have solr running!
2) Another question is about query syntax that isn't supported by DB and goes directly to SOLR.
Your example of TYPE and an attribute does not go to solr. It's things like PATH that must go to SOLR.
3) "Any property checks must be expressed in a form that means "identical value check" as querying the DB does not provide the same tokenization / similarity capabilities as the SOLR index. E.g. instead of my:property:"value" you'd have to use =my:property:"value" and "value" must be written in the proper case the value is stored in the DB."
What they are saying is that you must use the = operator, not the default or # operator. The # operator depends on tokenization, but TMQs go straight to the database. However, you can use * in an attribute if you omit the "", like so:
=cm:\title:Startswith*
Works for me on 5.0.2 vía TMQ. You can absolutely search for an exact value as well however.
I hope this cleared it up for you. I highly recommend putting the solr.query.fts.queryConsistency=TRANSACTIONAL to force TMQs always in a test evironment and testing different queries if you still have questions about what syntax is supported.
Regards
A nice explanation can be found here.
https://community.alfresco.com/people/andy1/blog/2017/06/19/explaining-eventual-consistency
When changes are made to the repository they are picked up by SOLR via
a polling mechanism. The required updates are made to the Index Engine
to keep the two in sync. This takes some time. The Index Engine may
well be in a state that reflects some previous version of the
repository. It will eventually catch up and be consistent with the
repository - assuming it is not forever changing.
Related
I have a big list of related terms (not synonyms) that I would like my solr engine to take into account when searching. For example:
Database --> PostgreSQL, Oracle, Derby, MySQL, MSSQL, RabbitMQ, MongoDB
For this kind of list, I would like Solr to take into account that if a user is searching for "postgresql configuration" he might also bring results related to "RabbitMQ" or "Oracle", but not as absolute synonyms. Just to boost results that have these keywords/terms.
What is the best approach to implement such connection? Thanks!
You've already discovered that these are synonyms - and that you want to use that metainformation as a boost (which is a good idea).
The key is then to define a field that does what you want - in addition to your regular field. Most of these cases are implemented by having a second field that does the "less accurate" version of the field, and apply a lower boost to matches in that field compared to the accurate version.
You define both fields - one with synonyms (for example content_synonyms) and one without (content), and then add a copyField instruction from the content field (this means that Solr will take anything submitted to the content field and "copy" it as the source text for the content_synonyms field as well.
Using edismax you can then use qf to query both fields and give a higher weight to the exact content field: qf=content^10 content_synonyms will score hits in content 10x higher than hits in content_synonyms, in effect using the synonym field for boosting content.
The exact weights will have to be adjusted to fit your use case, document profile and query profile.
When is it safe to update the Solr schema and keep the existing indexes?
I am upgrading Solr to version 7.2 now, and some type definitions in my old schema generate warnings in the log like:
Solr loaded a deprecated plugin/analysis class [solr.CurrencyField]. Please consult documentation how to replace it accordingly.
Is it safe to update this type definition to the new solr.CurrencyFieldType and keep my existing indexes:
When the type is not used in the schema for document properties.
When the type is used in the schema for document properties.
Generally, what schema change will definitely require a total reindex of the documents?
If the field isn't being used, you can do anything you like with it - the schema is Solr's way of enforcing validation and expose certain low level Lucene settings for field configuration. If you've never indexed any content using the field, then you can update the field definition (or maybe better, remove it if you're not using it) without reindexing.
However, if you change the definition of an existing field to a different type (for example, when the int type changed from being a TrieInt to a Point field), it's a general rule that you'll have to reindex to avoid getting random weird, untraceable issues.
For TextFields, if you're not changing the field type - i.e. the field is still of the same type, but you're changing the analysis or tokenization change for the field, you might not have to reindex. If the change is only to the query part of the analysis chain, no reindexing is needed - if the change is to the indexing part (or both), it depends on what the change is - the existing tokens stored in the index won't change, so if you have indexed content without lowercasing it, and then add for example a lowercase filter for querying, you won't get a match for any existing tokens that contain uppercase. In that case you'll have to reindex to make your collection work properly again.
I am working with a solr index that I have not made. I only have access to the solr admin.
In each document that is returned by the query I write in the solr admin, has around 40 fields. These fields are not sorted alphabetically.
Now my question is can I sort them somehow in the solr admin?
If I can not, I have the opportunity to import that index locally in my dev machine. I also have access to the config (solr config, data import config etc) files.
Is it possible to do some magic in any of those config files and import locally which will sort them alphabetically?
No, neither Lucene or Solr guarantees the order of the fields returned (the order of values inside a multi-valued field is however guaranteed)
You might have luck (you won't - see comment below - fl maintains the same order as in the document) by explicitly using the fl parameter to get the order you want, but that would require maintaining a long list of fields to be returned.
It's usually better to ask why you'd need the order of the fields to maintained. The data returned from Solr is usually not meant for the user directly, and should be processed in your controller / view layer to suit the use case.
You could return it using XSLT response writer instead of XML one. Usually it is used to transform XML into a different form, but you could probably use it for identity transformation but with sorting.
I don't think that's the best way forward, but if you are desperate, it is a way.
I had the following question about GAE NDB - Index.
I assume you can specify index via index.yaml or within the model definition using property option, indexed = true. Am I correct? If yes is one preferred over the other?
Is there a way to add/drop index during the life cycle of the data objects?
Can I specify an index on a structured property field?
If so, then can you please let me know** as the syntax for this?
Thanks in advance
By default, the properties that can be indexed (i.e. those that aren't variants of Blob) are indexed, which means you can filter or sort by them on their own. Adding single-property indexes to index.yaml would be unusual. Setting indexed=False for a property will mean fewer write-operations when saving entities, but will mean filtering or sorting by the property is no longer possible. I'd suggest reading the documentation on indexes.
If you want to filter or sort (in combination) by more than one property, then you need to include them in index.yaml. However, as you run code in the development server, if it requires an index that hasn't yet been specified, then index.yaml will be modified to contain a suitable index for the query being run. Adding indexes manually isn't necessarily something you'll ever have to do.
You can't index an entire StructuredProperty, the properties of Structured Properties are individually indexed, and don't need to think about them any differently than for regular properties. If you want to manually specify a multi-property index that includes a sub-property, then you should be able to do so by using 'property.subproperty' (e.g. 'address.city').
s1) Yes, you can set certain properties as being indexed. Some property types do not allow indexing at all. It's preferable to set the indexes programmatically within each model definition.
2) Although you can drop the index programmatically (i.e. remove indexed=True), I would not recommend it. It will leave your data store in inconsistent state.
3) It's not possible to set index on a structured property, however, you can set a Key relationship between your model and the models in the structured property.
See:
https://developers.google.com/appengine/docs/python/ndb/entities
https://developers.google.com/appengine/docs/python/ndb/properties
"You can specify the usual property options for structured properties
(except indexed)."
As I discovered the hard way (see GAE python NDB projection query working in development but not in production), there's a big difference between having an index (and therefore needing an entry in index.yaml) and marking properties as indexed or not indexed. These things are there for different purposes:
Having an index allows you to search, sort, or do projection queries.
Marking properties as indexed or not tells an index which entities to include in the index and which to ignore.
Yea, absolutely, you can add or drop index at any time:
Update your index.yaml
Then run one of the gcloud commands such as '$ gcloud datastore indexes create index.yaml' or '$ gcloud datastore indexes cleanup index.yaml'
No, you can't create an index on a structured property. See more info here https://cloud.google.com/appengine/docs/standard/python/ndb/entity-property-reference#structured
ElasticSearch has Mapping Types to, according to the docs:
Mapping types are a way to divide the documents in an index into
logical groups. Think of it as tables in a database.
Is there an equivalent in Solr for this?
I have seen that some people include a new field in the documents and later on they use this new field to limit the search to a certain type of documents, but as I understand it, they have to share the schema and (I believe) ElasticSearch Mapping Type doesn't. So, is there an equivalent?
Or, maybe a better question,
If I have a multiple document types and I want to limit searches to a certain document type, which one should offer a better solution?
I hope this question has any sense since I'm new to both of them.
Thanks!
You can configure multicore solr:
http://wiki.apache.org/solr/CoreAdmin
Maybe something has changed since solr 4.0 and it's easier now, i didn't look at it since i have switched to elasticsearch. Personally i find elasticsearch indexes/types system much better than that.
In Solr 4+.
If you are planning to do faceting or any other calculations across multiple types than create a single schema with a differentiator field. Then, on your business/mapping/client layer just define only the fields you actually want to look at. Use custom search handlers with 'fl' field to only return the fields relevant to that object. Of course, that means that all those single-type-only fields cannot be compulsory.
If your document types are completely disjoint, you can create a core/collection per type, each with its own definition file. You have full separation, but still have only one Solr server to maintain.
I have seen that some people include a new field in the documents and later on they use this new field to limit the search to a certain type of documents, but as I understand it, they have to share the schema and (I believe) ElasticSearch Mapping Type doesn't.
You can exactly do this in Solr. Add a field and use it to filter.
It is correct that Mapping Types in ElasticSearch do not have to share the same schema but under the hood ElasticSearch uses only ONE schema for all Mapping Types. So technical it makes to difference. In fact the MappingType is mapped to an internal schema field.