Properties vs Categories on an Aspect in Alfresco - solr

I'm using Alfresco 4.1.6 and Solr 1.4.
I'm reading about the possibility of using classifications for the nodes, specified with a type d:category in an aspect on the content model.
A good time of searchs in our project are the most important, is the reason I try to design the best option possible for this. Our repository have over 2 millions of documents, spread over directories, where each user (we have 3000 users aprox) have an own root path.
For the queries (FTS_ALFRESCO), we actually use TYPE (we have 5 distinct types of nodes defined on our model) and custom properties (all of them that we use in the queries are indexed).
My question is... Imagine I change my model and use one of our properties like a category. I delete a property and create an aspect with d:category with this property. The search will be more efficient and quickly if I search by TYPE, property and category? Alfresco gives me a best performance if I search this value like a category instead of when I search this value like a normal indexed property? Or really is the same? Whats the benefits of use this like a category?

Category and Properties both of them has different usage.
Main difference is
Property:You could have different value of same property for each content
Category:You will have same category which can be associated to muliple contents
So, based on your requirement you have to choose which one you want to use. As far as performence is concerned I guess category based search will be faster(I haven't really tried it though).

Related

Azure Search - Hierarchical facets guidance

I'm developing a project where I want to have hierarchical facets.
I have an index with a complex structure, like:
Index
-field1
-List
And othercomplexfield contains another list with anothercomplexfield inside.
I'd like to be able to give to users the possibility to:
Have the facets of field1.
When one is selected, I'd like to give the user the possibility to select one of the values of a certain field of "othercomplexfield" while filtering by the selected field1.
I can do that.
I'd then like to give the user the possibility to select one of the possible values of "anothercomplexfield" while filtering by field1 AND by the selected othercomplexfield.
The difficulty here is that I don't want every possible facet value, but only the ones CONTAINED by the othercomplexfield that I'm filtering for.
So far I had to do this inside of c# and i did not find a way to write a query that gives me back from azure search the distinct values that I want.
Someone has a similar problem?
Did I explain the problem well enough?
I saw no clear guidance online, everything is easy if you only have level 1 facets but when you get into nested objects it's not that clear anymore.
I'm not sure I fully understand the context of your question. What I can tell you is that filters only apply at the document level and not at the complex collection level. What I mean by that is that if a filter matches an item in a complex collection, the entire document will be returned, not just the item in the complex collection that matched. The same is true for facets--facets will count all documents in the result set that match the filter and can't be scoped down just to parts of documents. With that, it seems like having this logic in your application like you mentioned might be the best approach for your current index schema.
We do have this old blog post that talks about one way to implement hierarchical facets with Azure Cognitive Search which may give you some other ideas on how you could implement the functionality you're looking for: https://learn.microsoft.com/en-us/archive/blogs/onsearch/multi-level-taxonomy-facets-in-azure-search

Solr documents with multiple parents

I'm currently trying to figure out if Solr is the right tool for me. I have the following setup:
There is the primary document type "blog". Then there are two additional document types "user" and "category". Both of these are parents of the "blog" document type.
Now when searching the "blog" documents, I not only want to search in those fields (e.g. title and content), but also in the parent fields (user>name and category>name.
Of course, I could just flatten that down to a single document for Solr, which would ease the search a lot. The downside to this is though, that when e.g. a user updates their name, I have to run through all blog posts of them and update the documents for that in Solr, instead of just updating a single document.
This becomes even worse when the user has another parent, on which I need to search as well.
Do you have any recommendations about how to handle this use case? Maybe my Google foo is just not good enough, but what I found (block joins, etc.) don't seem to do the trick.
The absolutely most performant and easiest solution would be to flatten everything to a single document. It turns out that these relations aren't updated as often as people think, and that searches are performed more often than the documents update. And even if one of the values that are identical across a large set of documents change, reindexing from the most recent documents (for a blog) and then going backwards will appear rather performant for most users. The assumes that you have to actually search the values and don't just need the values - which you could look up from secondary storage when displaying an item (and just store the never changing id in the document).
Another option is to divide this into a multi-search problem. One collection for blog posts, one collection for users and one collection for categories. You then search through each of the collections for the relevant data and merge it in your search model. You can also use [Streaming Expressions] to hand off most of this processing to a Solr cluster for you.
The reason why I always recommend flattening if possible is that most features in Solr (and Lucene) are written for a flat document structure, and allows you to fully leverage the features available. Since Lucene by design is a flat document store, most other features require special care to support blockjoins and parent/child relationships, and you end up experimenting a lot to get the correct queries and feature set you want (if possible). If the documents are flat, it just works.

Data storage: "grouping" entities by property value? (like a dictionary/map?)

Using AppEngine datastore, but this might be agnostic, no idea.
Assume a database entity called Comment. Each Comment belongs to a User. Every Comment has a date property, pretty standard so far.
I want something that will let me: specify a User and get back a dictionary-ish (coming from a Python background, pardon. Hash table, map, however it should be called in this context) data structure where:
keys: every date appearing in the User's comment
values: Comments that were made on date.
I guess I could just iterate over a range of dates an build a map like this myself, but I seriously doubt I need to "invent" my own solution here.
Is there a way/tool/technique to do this?
Datastore supports both references and list properties. This let's you build one-to-many relationships in two ways:
Parent (User) has a list property containing keys of Child entities (Comment).
Child has a key property pointing to Parent.
Since you need to limit Comments by date, you'd best go with option two. Then you could query Comments which have date=somedate (or date range) and where user=someuserkey.
There is no native grouping functionality in Datastore, so to also "group" by date, you can add a sort on date to the query. Than when you iterate over the result, when the date changes you can use/store it as a grouping key.
Update
Designing no-sql databases should be access-oriented (versus datamodel oriented in sql): for often-used operations you should be getting data out as cheaply (= as few operations) as possible.
So, as a rule of thumb you should, in one operation, only get data that is needed at that moment (= shown on that page to user). I'm not sure about your app's design, but I doubt you need all user's full comments (with text and everything) at one time.
I'd start by saying you shouldn't apologize for having a Python background. App Engine started supporting only Python. Using the db module, you could have a User entity as the parent of several DailyCommentBatch entities each a parent of a couple Comment entities. IIRC, this will keep all related entities stored together (or close).
If you are using the NDB (I love it) you may have employ a StructuredProperty either at the User or DailyCommentBatch levels.

How can I discover if a property of a stored Entity is indexed or unindexed?

I have several entities in datastore, but I don't know if some of their properties are indexed or unindexed.
How can I discover (with admin console or programatically) if a property of a stored Entity is indexed or unindexed?
By default each entity is indexed (unless its TextProperty or BlobProperty), you need (and should) set the property indexed property to False if you don't want it to be indexed (to improve performance and entity writing costs).
There is no indication in the admin console on if a property is indexed or not, You can try to execute "select * from EntityType order by Property" in the GQL of the datastore views and see if it fails.
If you've been flipping between indexed=True and indexed=False on some properties over time, and have a set of entities written under both regimes, then you'll have some properties that are indexed and some that aren't. Is this the situation you're in?
If you don't have reliable history on your code, trying to determine if you're in this situation is a bit tricky, depending on how many entities you have. You can determine if you're in an inconsistent state by noting if a keys-only query on an Entity returns a different number of keys than a query that filters on the suspect property. A filter won't find unindexed properties. If you've got a lot of entities, you'll have to shard the counting somehow (to avoid timing out on a long query that returns lots of entities).
If you determine that you do have inconsistent indexing and want to repair your entities to be consistent, the usual approach is to write a mapreduce that touches all of your unstable entities and issues puts on the necessary properties.
Take a look at "Datastore Indexes" interface, link for which is located on the left navigation menu in app engine dashboard.
There you'll see list of indexes and the specific properties on which an index has been applied.
For composite indexes (i.e. the one defined in datastore-indexes.xml or index.yaml), you could use the low-level API to get the list of indexes that are present in your app's datastore.
In GAE/J, you would need to invoke DatastoreServiceFactory.getDatastoreService().getIndexes(), while in Python, the same function is provided by db.get_indexes().

Tagging schema for AppEngine

Hey,
I'm using AppEngine for an application that I'm writing. So I need to assign tags each object. I wanted to know what is the best way of doing this.
Should I create a space seperated string of tags and then query something like %search_tag% (I'm not sure if you can do that in JDOQL)?
What other options do I have ?
Should I create another class which will map every object to a tag?
Which would be the best from the point of view of scalability, performance and ease of use?
Thanks
First, '%search_tag%' type 'LIKE' queries do not work on App Engine's datastore. The best you can do is a prefix search.
It is difficult to answer very general questions like this. The best solution will depend on several factors, how many tags do you expect per entity? Is there a limit to the number of tags? How will you use the tags? For searching? For display only? The answers to all these questions impact how you should design your models.
One general solution for tagging is to use a multi-valued property, such as a list of tags.
http://code.google.com/appengine/docs/java/datastore/dataclasses.html#Collections
Be aware, if you will have many tags on your entities it will add overhead at write time, since the indexes writes need time too. Also, you should try to avoid using multi-valued properties multiple times (or multiple multi-value properties) in queries with inequalities or orders. That can lead to 'exploding indexes,' since one index row gets written for every combination of the indexed fields.

Resources