Optimization of solr indexing by removing redundancy - solr

I'm working on a production scenario, currently having less data, which is now going to be in Millions.Scenario: I'm having a folder that contains multiple students' data (student_id, rol etc).
Now, one student's data can be in different folders (Yes its our requirement). At current system, all the details of student are indexed under each folder. As data is very less, so duplicacy doesn't create problem right now. But, if we continue at same process, then same student's data will be indexed multiple times (depends on number of folders containing that student data), thereby increasing redundancy and index size.
I want to minify the index size and dont want data redundancy.Please provide a simpler solution for achieving this task in Solr.

As long as you have a uniqueKey field defined, any document with the same key as a previous document will overwrite the existing document, and you'll avoid having duplicates in your index.
If you don't have a unique value that identify your students, you're going to have a hard time merging anything (outside of Solr as well), and you might have to write some custom code to merge entries appropriately outside of Solr.

Related

Solr Query Performance on Large Number Of Dynamic Fields

This question is a follow-up question to my previous question: Is child documents in solr anti-pattern?
I am creating a new question on dynamic field performance as I did not find any recent relevant posts on this topic and felt it deserved a separate question here.
I am aware that dynamic fields are treated as static fields and performance-wise both are similar.
Further, from what I have read, in terms of memory, dynamic fields are not efficient. Say, if a document has 100 fields and another has 1000(max number of fields in the collection), Apache Solr will allocate the memory block to support all 1000 fields for all the documents in the collection.
I have a requirement where I have 6-7 fields that could be part of child documents and each parent document could have up to 300 child documents. Which means each parent document could have ~2000 fields.
What will be the performance impact on queries when we have such a large number of fields in the document?
That really depends on what you want to do with the field and what the definition of these fields are. With docValues, most earlier issues with memory usage for sparse fields (i.e. fields that only have values in a small number of the total number of documents) are solved.
Also, you can usually rewrite those dynamic fields to a single multiValued field for filtering instead of filtering on each field (i.e. common_field:field_prefix_value where common_field contains the values you want to filter on prefixed with a field name / unique field id).
Anyway, the last case is that it'll depend on how many documents you have in total. If you only have 1000 documents, it won't be an issue in any way. If you have a million, it used to be - depending on what you needed those dynamic fields for. These days it really isn't an issue, and I'd start out with the naive, direct solution and see if that works properly for your use case. It's rather hard to say without knowing exactly what these fields will contain, what the use case for the fields are, what they'll be used for and the query profile of your application.
Also consider using a "side car" index if necessary, i.e. a special index with duplicated data from your main index to solve certain queries or query requirements. You pick which index to search based on the use case, and then return the appropriate data to the user.

Is there a better way to represent provenenace on a field level in SOLR

I have documents in SOLR which consist of fields where the values come from different source systems. The reason why I am doing this is because this document is what I want returned from the SOLR search, including functionality like hit highlighting. As far as I know, if I use join with multiple SOLR documents, there is no way to get what matched in the related documents. My document has fields like:
id => unique entity id
type => entity type
name => entity name
field_1_s => dynamic field from system A
field_2_s => dynamic field from system B
...
Now, my problem comes when data is updated in one of the source systems. I need to update or remove only the fields that correspond to that source system and keep the other fields untouched. My thought is to encode the dynamic field name with the first part of the field name being a 8 character hash representing the source system.. this way they can have common field names outside of the unique source hash. And in this way, I can easily clear out all fields that start with the source prefix, if needed.
Does this sound like something I should be doing, or is there some other way that others have attempted?
In our experience the easiest and least error prone way of implementing something like this is to have a straight forward way to build the resulting document, and then reindex the complete document with data from both subsystems retrieved at time of reindexing. Tracking field names and field removal tend to get into a lot of business rules that live outside of where you'd normally work with them.
By focusing on making the task of indexing a specific document easy and performant, you'll make the system more flexible regarding other issues in the future as well (retrieving all documents with a certain value from Solr, then triggering a reindex for those documents from a utility script, etc.).
That way you'll also have the same indexing flow for your application and primary indexing code, so that you don't have to maintain several sets of indexing code to do different stuff.
If the systems you're querying isn't able to perform when retrieving the number of documents you need, you can add a local cache (in SQL, memcached or something similar) to speed up the process, but that code can be specific to the indexing process. Usually the subsystems will be performant enough (at least if doing batch retrieval depending on the documents that are being updated).

Choice of Database schema for storing folder system

I'm trying to implement an SQLite-based database that can store the full structure of a 100GB folder with a complex substructure (expecting 50-100K files). The main aim of the DB would be to get rapid queries on various aspects of this folder (total size, size of any folder, history of a folder and all it's contents, etc).
However, I realized that finding all the files inside a folder, including all of it's sub-folders is not possible without recursive queries if I just make a "file" table with just a parent_directory field. I consider this as one of the most important features I want in my code, so I have considered two schema options for this as shown in the figure below.
In schema 1, I store all the file names in one table and directory names in another table. They both have a "parentdir" item, but also have a text (apparently text/blob are the same in sqlite) field called "FullPath" that will save the entire path from the root to the particular file/directory (like /etc/abc/def/wow/longpath/test.txt). I'm not assuming a maximum subfolder limit so this could theoretically be a field that allows up to 30K characters. My idea is that then if I want all the files or directories belonging to any parent I just query the fullpath of the parent on this field and get the fileIDs
In schema 2, I store only filenames, fileIDs and DirNames, DirIDs in the directories and files tables, respectively. But in a third table called "Ancestors", I store for each file a set of entries for each directory that is it's ancestor (so in the above example, test.txt will have 5 entries, pointing to the DirIDs of the folders etc,abc,def,wow and longpath respectively). Then if I want the full contents of any folder I just look for the DirID in this table and get all the fileIDs.
I can see that in schema 1 the main limit might be full-text search of variable length text column and in schema 2 the main limit being that I might have to add a ton of entries for files that are buried deep within 100 directories or something.
What would be the best of these solutions? Is there any better solution that I did not think of?
Your first schema will work just fine.
When you put an index on the FullPath column, use either the case-sensitive BETWEEN operator for queries, or use LIKE with either COLLATE NOCASE on the index or with PRAGMA case_sensitive_like.
Please note that this schema also stores all parents, but the IDs (the names) are all concatenated into one value.
Renaming a directory would require updating all its subtree entries, but you mention history, so it's possible that old entries should stay the same.
Your second schema is essentially the Closure Table mentioned in Dan D's comment.
Take care to not forget the entries for depth 0.
This will store lots of data, but being IDs, the values should not be too large.
(You don't actually need RelationshipID, do you?)
Another choice for storing trees is the nested set model, or the similar nested interval model.
The nested set model allows to retrieve subtrees by querying for an interval, but updates are hairy.
The nested interval model uses fractions, which are not a native data type and therefore cannot be indexed.
I'd estimate that the first alternative would be easiest to use.
I should also be no slower than the others if lookups are properly indexed.
My personal favourite is the visitation number approach, which I think would be especially useful for you since it makes it pretty easy to run aggregate queries against a record and its descendants.

Should I denormalize properties to reduce the number of indexes required by App Engine?

One of my queries can take a lot of different filters and sort orders depending on user input. This generates a huge index.yaml file of 50+ indexes.
I'm thinking of denormalizing many of my boolean and multi-choice (string) properties into a single string list property. This way, I will reduce the number of query combinations because most queries will simply add a filter to the string list property, and my index count should decrease dramatically.
It will surely increase my storage size, but this isn't really an issue as I won't have that much data.
Does this sound like a good idea or are there any other drawbacks with this approach?
As always, this depends on how you want to query your entities. For most of the sorts of queries you could execute against a list of properties like this, App Engine will already include an automatically built index, which you don't have to specify in app.yaml. Likewise, most queries that you'd want to execute that require a composite index, you couldn't do with a list property, or would require an 'exploding' index on that list property.
If you tell us more about the sort of queries you typically run on this object, we can give you more specific advice.
Denormalizing your data to cut back on the number of indices sounds like it a good tradeoff. Reducing the number of indices you need will have fewer indices to update (though your one index will have more updates); it is unclear how this will affect performance on GAE. Size will of course be larger if you leave the original fields in place (since you're copying data into the string list property), but this might not be too significant unless your entity was quite large already.
This is complicated a little bit since the index on the list will contain one entry for each element in the list on each entity (rather than just one entry per entity). This will certainly impact space, and query performance. Also, be wary of creating an index which contains multiple list properties or you could run into a problem with exploding indices (multiple list properties => one index entry for each combination of values from each list).
Try experimenting and see how it works in practice for you (use AppStats!).
"It will surely increase my storage size, but this isn't really an issue as I won't have that much data."
If this is true then you have no reason to denormalize.

How to organize a large number of objects

We have a large number of documents and metadata (xml files) associated with these documents. What is the best way to organize them?
Currently we have created a directory hierarchy:
/repository/category/date(when they were loaded into our db)/document_number.pdf and .xml
We use the path as a unique identifier for the document in our system.
Having a flat structure doesn't seem to a good option. Also using the path as an id helps to keep our data independent from our database/application logic, so we can reload them easily in case of failure, and all documents will maintain their old ids.
Yet, it introduces some limitations. for example we can't move the files once they've been placed in this structure, also it takes work to put them this way.
What is the best practice? How websites such as Scribd deal with this problem?
Your approach does not seem unreasonable, but might suffer if you get more than a few thousand documents added within a single day (file systems tend not to cope well with very large numbers of files in a directory).
Storing the .xml document beside the .pdf seems a bit odd - If it's really metadata about the document, should it not be in the database (which it sounds like you already have) where it can be easily queries and indexed etc?
When storing very large numbers of files I've usually taken the file's key (say, a URL), hashed it, and then stored it X levels deep in directories based on the first characters of the hash...
Say you started with the key 'How to organize a large number of objects'. The md5 hash for that is 0a74d5fb3da8648126ec106623761ac5 so you might store it at...
base_dir/0/a/7/4/http___stackoverflow.com_questions_2734454_how-to-organize-a-large-number-of-objects
...or something like that which you can easily find again given the key you started with.
This kind of approach has one advantage over your date one in that it can be scaled to suit very large numbers of documents (even per day) without any one directory becoming too large, but on the other hand, it's less intuitive to someone having to manually find a particular file.

Resources