How to properly use fieldsets in vespa?

How to properly use fieldsets in vespa? - vespa

I'm seeing a warnings like these when running vespa-deploy prepare command
The matching settings for the fields in fieldset 'default' are inconsistent (explicitly or because of field type). This may lead to recall and ranking issues.
The normalization settings for the fields in fieldset 'default' are inconsistent (explicitly or because of field type). This may lead to recall and ranking issues
The stemming settings for the fields in the fieldset 'default' are inconsistent (explicitly or because of field type). This may lead to recall and ranking issues.
After going through some documentation, I'm almost sure that this is something related to fieldsets. What is causing these warnings? How can it be fixed?

You get this if your fields referenced in the fieldset has different settings in any of the matching,stemming or normalization (explicit or implicit given by the field type). Query is only processed using one set of configuration while on the document side during indexing each field is processed with it's own settings hence having different settings might lead to recall issues.

Related

Solr schema: Is it safe to update schema type definitions when they are not used?

When is it safe to update the Solr schema and keep the existing indexes?
I am upgrading Solr to version 7.2 now, and some type definitions in my old schema generate warnings in the log like:
Solr loaded a deprecated plugin/analysis class [solr.CurrencyField]. Please consult documentation how to replace it accordingly.
Is it safe to update this type definition to the new solr.CurrencyFieldType and keep my existing indexes:
When the type is not used in the schema for document properties.
When the type is used in the schema for document properties.
Generally, what schema change will definitely require a total reindex of the documents?

If the field isn't being used, you can do anything you like with it - the schema is Solr's way of enforcing validation and expose certain low level Lucene settings for field configuration. If you've never indexed any content using the field, then you can update the field definition (or maybe better, remove it if you're not using it) without reindexing.
However, if you change the definition of an existing field to a different type (for example, when the int type changed from being a TrieInt to a Point field), it's a general rule that you'll have to reindex to avoid getting random weird, untraceable issues.
For TextFields, if you're not changing the field type - i.e. the field is still of the same type, but you're changing the analysis or tokenization change for the field, you might not have to reindex. If the change is only to the query part of the analysis chain, no reindexing is needed - if the change is to the indexing part (or both), it depends on what the change is - the existing tokens stored in the index won't change, so if you have indexed content without lowercasing it, and then add for example a lowercase filter for querying, you won't get a match for any existing tokens that contain uppercase. In that case you'll have to reindex to make your collection work properly again.

Use Solr Schemaless feature without automatically adding unknown fields to managed-schema

I have different datasources that uploads different documents to Solr Sink. Now if two datasources sends a same name field with different data types (say integer & double) then indexing of second field fails because data type of first field is already added in managed-schema.
All I need is that both fields get indexed properly as they used to work in Solr 4.x versions .
Since field names come at runtime,please suggest a solution that would work for me. I suppose it needs a change in solrconfig.xml but couldnot find the required.

How was your Solr configured to work in 4.x? You can still do it exactly the same way in Solr 6.
On the other hand, schemaless feature will define the type mapping on the first time it sees the field. It has no way to know what will come in the future. That's also why all auto-definitions are multivalued.
However, if you want to deal with specific mapping of integer being too narrow, you can change the definition of the UpdateRequestProcessor chain that is actually doing the mapping. Just merge the mapping of integer/long/number into one final tdoubles type.

When loading data from Datastore into BigQuery with the command line tool, what determines the inclusion of subfields?

While using the command line tool to load from Datastore into BigQuery I've noticed the following strange behaviour. When I specify what fields to include using the option projection_fields, there is one rather complex nested field whose subfields are not all included. I can determine no pattern in the selection of subfields. Strangely, if I don't specify projection_fields (i.e. include all fields), all subfields are included. (At least I have to assume so, because one of these subfields is actually causing an error, see this previous question.)
I've not been able to find any explanation of projection_fields except that it can only be used on top-level fields. Is there some design behind this behaviour or is it a bug?

The answer to your question is in the official documentation for Jobs config (scroll down to "configuration.load.projectionFields"). It does indeed say the following (emphasis mine):
"If sourceFormat is set to "DATASTORE_BACKUP", indicates which entity properties to load into BigQuery from a Cloud Datastore backup. Property names are case sensitive and must be top-level properties. If no properties are specified, BigQuery loads all properties. If any named property isn't found in the Cloud Datastore backup, an invalid error is returned in the job result."
So, to answer your question, it is indeed by design.

I believe the subfields excluded are simply those that are null everywhere. The error referred to in the question has a different cause, and does not imply that these subfields would have been loaded when projection_fields had not been set.

Can Solr/Lucene do Fuzzy Field Collapsing?

Edit
Can Solr do fuzzy field collapsing? IE collapsing fields that have similar values, rather than identical ones?
I'd assumed that it could, but now I'm not sure, which makes my original question below invalid.
Original Question
For a large given set of values I need to decide which is the most prevalent. The set of all values will change over time, and so I can expect that the output may change over time too.
I gather Solr can do "field collapsing" to group results by a given field, with a tolerance of similarity. Would it be possible, neigh even appropriate, to use Solr solely to collapse fields, to derive the most common value? We use Solr in other parts of the business, and it would be good to leverage existing code rather than home-brewing a custom solution.

No, solr does not support fuzzy collapsing. (at least not based on what is documented on the wiki)
Solr 4.0 supports group.func which allows you to group results based on the result of a FunctionQuery, so it's possible that at some point in time a function could be created to get you approximately what you want, but none of the existing functions will do what you want.
However, Solr does support result clustering, which will maybe work for your use-case. Clustering is done with Carrot2. If you limit the fields used by carrot to a single field, you may get a similar result to "fuzzy clustering", but you have far less control over what carrot does than you do with field collapsing.
For a normal document you might want all your fields analyzed by carrot, e.g.:
carrot.title=my_title&carrot.snippet=my_title,my_description
But if you have, for example, a manufacturer field with slight variations of spelling or punctuation, it might work to only give carrot a single field for both title and snippet:
carrot.title=manufacturer&carrot.snippet=manufacturer

Are indexes really required in the datastore?

I'm a bit confused by some of the GAE documentation. While I intend to add indexes to optimize performance of my application, I wanted to get some clarification on if they are only suggested for this purpose or if they are truly required.
Queries can't find property values
that aren't indexed. This includes
properties that are marked as not
indexed, as well as properties with
values of the long text value type
(Text) or the long binary value type
(Blob).
A query with a filter or sort order on
a property will never match an entity
whose value for the property is a Text
or Blob, or which was written with
that property marked as not indexed.
Properties with such values behave as
if the property is not set with regard
to query filters and sort orders.
from http://code.google.com/appengine/docs/java/datastore/queries.html#Introduction_to_Indexes
The first paragraph leads me to believe that you simply cannot sort or filter on unindexed properties. However, the second paragraph makes me think that this limitation is only confined to Text or Blob properties or properties specifically annotated as unindexed.
I'm curious about the distinction because I have some numeric and string fields that I am currently sorting/filtering against in a production environment which are unindexed. These queries are being run in a background task that mostly doesn't care about performance (would rather optimize for size/cost in this sitation). Am I somehow just lucky that these are returning the right data?

In the GAE datastore, single property indexes are automatically created for all properties that are not unindexable (explicitly marked, or of those types).
The language in that doc, I suppose, is a tad confusing.
You only need to explicitly define indexes when you want to index by more than one property (say, for sorting by two different properties.)

In GAE, unfortunately if the property is marked as unindexed
num = db.IntegerProperty(required=True, indexed=False)
Then it is impossible to include it in the custom index... This is counterproductive (Most built-in indices are never used by my code, but take lots of space). But it is how GAE currently works.
Datastore Indexes - Unindexed properties:
Note: If a property appears in an index composed of multiple properties, then setting it to unindexed will prevent it from being indexed in the composed index.

Never add a property to a model without EXPLICITLY entering either indexed=True or indexed=False. Indices take substantial resources: space, write ops costs, and latency increases when doing put()s. We never, never add a property without explicitly stating its indexed value even if the index=False. Saves costly oversights, and forces one to always think a bit about whether or not to index. (You will at some point find yourself cursing the fact that you forgot to override the default=True.) GAE Engineers would do a great service by not allowing this to default to True imho. I would simply not provide a default if I was them. HTH. -stevep

you must use index if you want to use two or more filter function in one single query.
e.g:
Foobar.filter('foo =', foo).filter('bar =', bar)
if you just query with one filter, no need to use index, which is auto-generated.
for Blob and Text, you can't generate index for them, even you specify it in index.yaml, meanwhile you can't use filter in them.
e.g.
class Foobar(db.Model):
content = db.TextProperty()
Foobar.filter('content =', content)
codes above will raise an Error because TextProperty can't be assigned a index and can't be matched.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight