How do I index text files, web sites and database in the same Solr schema? All 3 sources are a requirement and I'm trying to figure out how to do it. I did some examples and they're working fine as they're separate from each other, now I need them all to be 1 schema since the user will be searching in all of those 3 data sources.
How should I proceed?
You should sketch up a few notes for each of your content sources:
What meta-data is available
How is the information accessed
How do I want to present the information
Once that is done, determine which meta-data you want to make searchable. Some of it might be very specific to just one of the content sources (such as author on web pages, or any given field in a DB row), while others will be present in all sources (such as unique ID, title, text content). Use copy-fields to consolidate fields as needed.
Meta-data will vary greatly from project to project, but yes -- things like update date, filename, and any structured data you can parse out of the text files will surely help you improve relevance. Beyond that, it varies a lot from case to case. Maybe the file paths hint at a (possibly informal) taxonomy you can use as metadata. Maybe filenames contain metadata themselves (such as year, keyword, product names, etc).
Be prepared to use different fields for different sources when displaying results. A source field goes a long way in terms of creating result tiles -- and it might turn out to be your most used facet.
An alternative (and probably preferred) approach to using copy-fields extensively, is using the DisMax/EDisMax request handlers, to facilitate searching in several fields.
Consider using a mix of copy-fields and (e)dismax. For instance, copy all fields into a catch-all text-field, that need not be stored, and include it in searches, but with a low boost-value, and include highly weighted fields (such as title, or headings, or keywords, or filename) in the search. There's a lot of parameters to tweak in dismax, but it's definately worth the effort.
Related
I'm currently trying to figure out if Solr is the right tool for me. I have the following setup:
There is the primary document type "blog". Then there are two additional document types "user" and "category". Both of these are parents of the "blog" document type.
Now when searching the "blog" documents, I not only want to search in those fields (e.g. title and content), but also in the parent fields (user>name and category>name.
Of course, I could just flatten that down to a single document for Solr, which would ease the search a lot. The downside to this is though, that when e.g. a user updates their name, I have to run through all blog posts of them and update the documents for that in Solr, instead of just updating a single document.
This becomes even worse when the user has another parent, on which I need to search as well.
Do you have any recommendations about how to handle this use case? Maybe my Google foo is just not good enough, but what I found (block joins, etc.) don't seem to do the trick.
The absolutely most performant and easiest solution would be to flatten everything to a single document. It turns out that these relations aren't updated as often as people think, and that searches are performed more often than the documents update. And even if one of the values that are identical across a large set of documents change, reindexing from the most recent documents (for a blog) and then going backwards will appear rather performant for most users. The assumes that you have to actually search the values and don't just need the values - which you could look up from secondary storage when displaying an item (and just store the never changing id in the document).
Another option is to divide this into a multi-search problem. One collection for blog posts, one collection for users and one collection for categories. You then search through each of the collections for the relevant data and merge it in your search model. You can also use [Streaming Expressions] to hand off most of this processing to a Solr cluster for you.
The reason why I always recommend flattening if possible is that most features in Solr (and Lucene) are written for a flat document structure, and allows you to fully leverage the features available. Since Lucene by design is a flat document store, most other features require special care to support blockjoins and parent/child relationships, and you end up experimenting a lot to get the correct queries and feature set you want (if possible). If the documents are flat, it just works.
I have documents in SOLR which consist of fields where the values come from different source systems. The reason why I am doing this is because this document is what I want returned from the SOLR search, including functionality like hit highlighting. As far as I know, if I use join with multiple SOLR documents, there is no way to get what matched in the related documents. My document has fields like:
id => unique entity id
type => entity type
name => entity name
field_1_s => dynamic field from system A
field_2_s => dynamic field from system B
...
Now, my problem comes when data is updated in one of the source systems. I need to update or remove only the fields that correspond to that source system and keep the other fields untouched. My thought is to encode the dynamic field name with the first part of the field name being a 8 character hash representing the source system.. this way they can have common field names outside of the unique source hash. And in this way, I can easily clear out all fields that start with the source prefix, if needed.
Does this sound like something I should be doing, or is there some other way that others have attempted?
In our experience the easiest and least error prone way of implementing something like this is to have a straight forward way to build the resulting document, and then reindex the complete document with data from both subsystems retrieved at time of reindexing. Tracking field names and field removal tend to get into a lot of business rules that live outside of where you'd normally work with them.
By focusing on making the task of indexing a specific document easy and performant, you'll make the system more flexible regarding other issues in the future as well (retrieving all documents with a certain value from Solr, then triggering a reindex for those documents from a utility script, etc.).
That way you'll also have the same indexing flow for your application and primary indexing code, so that you don't have to maintain several sets of indexing code to do different stuff.
If the systems you're querying isn't able to perform when retrieving the number of documents you need, you can add a local cache (in SQL, memcached or something similar) to speed up the process, but that code can be specific to the indexing process. Usually the subsystems will be performant enough (at least if doing batch retrieval depending on the documents that are being updated).
I am uploading many csv files
currency.csv file:
code,currency_name,currency_decimals
AUD,Australian Dollar,2
GBP,Pound Sterling,2
...
...
currency_holidays.csv file:
code,holiday_date,holiday_name
AUD,02/01/2012,New Year's Day Observed
AUD,26/01/2012,Australia Day
...
...
NOTE: uniqueKey is set to 'code' in solr configuration file
if I upload these files in to solr single core it would overwrite the matching currency recordes e.g. AUD. Right?
is it better to have core per file? i.e. multiple cores.
This is my previous post:
apache solr csv file same values
What is the best solution? I need help. Hope someone can help out.
Thanks
GM
Some of the points you might want to think upon :-
If you have completely different entities with nothing in common and not dependant as well (no joins), it would be better to have them as Separate Cores.
This would be a much cleaner approach.
As there might be fields which have a common name and would need to be analyzed in different ways as well as
Search behaving in different ways for fields and their boost
This would also be manageable if the data is huge.
However, if you have a very small dataset with none of the above concerning you just go with a single core.
You Unique keys you can prefix the ids with the type e.g. curreny_aud and holiday_aud which will help you keep the entities seperate and prevent overwriting.
This is more of a question for experienced people who've worked a lot with multilingual websites and e-shops. This is NOT a database structure question or anything like that. This is a question on how to store a multilingual website: NOT how to store translations. A multilingual website can not only be translated into multiple languages, but also can have language-specific content. For instance an english version of the website can have a completely different structure than the same website in russian or any other language. I've thought up of 2 storage schemas for such cases:
// NUMBER ONE
table contents // to store some HYPOTHETICAL content
id // content id
table contents_loc // to translate the content
content, // ID of content to translate
lang, // language to translate to
value, // translated content
online // availability flag, VERY IMPORTANT
ADVANTAGES:
- Content can be stored in multiple languages. This schema is pretty common, except maybe for the "online" flag in the "_loc" tables. About that below.
- Every content can not only be translated into multiple languages, but also you could mark online=false for a single language and disable the content from appearing in that language. Alternatively, that record could be removed from "_loc" table to achieve the same functionality as online=false, but this time it would be permanent and couldn't be easily undone. For instance we could create some sort of a menu, but we don't want one or more items to appear in english - so we use online=false on those "translations".
DISADVANTAGES:
- Quickly gets pretty ugly with more complex table relations.
- More difficult queries.
// NUMBER 2
table contents // to store some HYPOTHETICAL content
id, // content id
online // content availability (not the same as in first example)
lang, // language of the content
value, // translated content
ADVANTAGES:
1. Less painful to implement
2. Shorter queries
DISADVANTAGES:
2. Every multilingual record would now have 3 different IDs. It would be bad for eg. products in an e-shop, since the first version would allow us to store different languages under the same ID and this one would require 3 separate records to represent the same product.
First storage option would seem like a great solution, since you could easily use it instead of the second one as well, but you couldn't easily do it the other way around.
The only problem is ... the first structure seems a bit like an overkill (except in cases like product storage)
So my question to you is:
Is it logical to implement the first storage option? In your experience, would anyone ever need such a solution?
The question we ask ourselves is always:
Is the content the same for multiple languages and do they need a relation?
Translatable models
If the answer is yes you need a translatable model. So a model with multiple versions of the same record. So you need a language flag for each record.
PROS: It gives you a structure in which you can see for example which content has not yet been translated.
Separate records per language
But many times we see a different solution as the better one: Just seperate both languages totally. We mostly see this in CMS solutions. The story is not only translated but also different. For example in country 1 they have a different menu structure, other news items, other products and other pages.
PROS: Total flexibility and no unexpected records from other languages.
Example
We see it like writing a magazine: You can write one, then translate to another language. Yes that's possible but in real world we see more and more that the content is structurally different. People don't like to be surprised so you need lots of steps to make sure content is not visible in wrong languages, pages don't get created in duplicate etc.
Sharing logic
So what we do is most time: Share the views, make the buttons, inputs etc. translatable but keep the content seperated. So that every admin can just work in his area. If we need to confirm that some records are available in all languages we can always trick that by creating a link (nicely relational) between them but it is not the standard we use most of the time.
Really translatable records like products
Because we are flexible in creating models etc. we can just use decide how to work with them based on the requirements. I would not try to look for a general solution which works for all because there is none. You need a solution based on your data.
Assuming that you need a translatable model, as it is described by Luc, I would suggest coming up with some sort of special-character-delimited key-value pair format for the value column of the content table. Example:
#en=English Term#de=German Term
You may use UDFs (User Defined Functions in T-SQL) to set/get the appropriate term based on the specified language.
For selecting :
select id, dbo.GetContentInLang(value, #lang)
from content
For updating:
update content
set value = dbo.SetContentInLang(value, #lang, new_content)
where id = #id
The UDFs:
a. do have a performance hit but this also the case for join that you will have to do between the content and content_loc tables
and
b. are somehow difficult to implement but are reusable practically throughout your database.
You can also do the above on the application/UI layer.
Given a collection of files which will have associated metadata, what are the recommended methods for storing this metadata?
Some files formats support storing metadata internally (EXIF,ID3,etc), but not all file formats support this, so what are more general options?
Some of the metadata would almost certainly be unique (titles/descriptions/etc), whilst some would be repetitive to varying degrees (categories/tags/etc).
It may also be useful to group the metadata, if different types of attribute are required.
Ideally, solutions should cover concepts, rather than specific language implementations.
To store metadata in database has some advantages but main problem with database is that metadata are not directly connected to your data. It is more robust if metada stay with data - like special file in the directory or something like that.
Some filesystems offer special functionality that can be used for metadata - like NTFS Alternate streams. Unfortunately, this can be used for metadata storage in special cases only, because those streams can be easily lost when copying data to storage system that does not support it. I believe that linux filesystems have also similar storage mechanism.
Anyway, most common solutions are :
separate hidden file(s) (per directory) that hold metadata
some application use special hidden directory with metadata (like subversion, cvs etc).
or database (of various kinds) for all application specific metada - this database can be used also for caching purposes in most cases
IMO there is no general purpose solution. I would choose storage of metadata in hidden file (robustness) with use of the database for fast access and caching.
One option might be a relational database, structured like this:
FILE
f_id
f_location
f_title
f_description
ATTRIBUTE
a_id
a_label
VALUE
v_id
v_label
METADATA
md_file
md_attribute
md_value
This implementation has some unique information (title/description),
but is primarily targetted at repetitive groups of data.
For some requirements, other less generic tables may be more useful.
This has advantages of this being that relational databases are very common,
and obviously very good at handling relationships and storing lots of data.
However, for some uses a database server brings an overhead which might not be desirable.
Also, the database server is distinct from the files - they do not sit together, and require different methods of interaction.
Databases do not (easily) sit under version control - which may be a good or bad thing, depending on your point of view and specific needs.
I think the "solution" depends greatly upon what you're going to be doing with the metadata.
For example, almost all of the metadata we store (Multiple datasets of scientific data) are all chopped up and stored in a database. This allows us to create datasets to preserve the common metadata between the files (as you say, categories and tags) while we have file specific structures (title, start/stop time, min/max values etc.) While we could keep these in hidden files, we do a lot of searching and open our interface to outside consumers via web services.
If you're storing metadata that isn't going to be searched on, hidden files or a dedicated .xml file per "real" file isn't a bad route to take. It's readable by basically anything, can be converted to different formats easily, and won't be lost if you decide to change your storage mechanism.
Metadata should help you, not hinder you. I've seen (and been a part of) systems where metadata storage has become more burdensome than storing the actual data, and became a liability. Just keep in mind what you are trying to do with it, and don't over extend yourself with "what ifs."
Plain text has some obvious advantages over anything else. Something like
FileName = 'ferrari.gif'
Title = 'My brand new car'
Tags = 'cars', 'cool'
Related = 'michaelknight.mp3'
Picasa's Picasa.ini files are a good example for this kind of metadata. Also, instead of inventing your own format, XML might be worth considering. There are plenty of readily available DOM processors to deal with this format.
Then again, if the amount of files and relations between them is huge, databases may be better.
I would basically make a metadata DB which held this information:
RESOURCE_TABLE
RESOURCE_ID
RESOURCE_TYPE (folder, doctype, web link, other)
RESOURCE_URL (any URL)
NOTES_TABLE
NOTE_ID
RESOURCE_NO
RESOURCE_NOTE (long text)
TAGS_TABLE
TAG_ID
RESOURCE_NO
TAG_TEXT
Then I would use the note field textual notes to the file/folder/resource. Choose if you would use 1:1 or 1:N for this.
The tags field I would use to store any number of searchable parameters like YEAR, PROJECT, and other values that will describe and group your content.
Then you could add tables for owner, stakeholders, and other organisation info etc.