I am uploading many csv files
currency.csv file:
code,currency_name,currency_decimals
AUD,Australian Dollar,2
GBP,Pound Sterling,2
...
...
currency_holidays.csv file:
code,holiday_date,holiday_name
AUD,02/01/2012,New Year's Day Observed
AUD,26/01/2012,Australia Day
...
...
NOTE: uniqueKey is set to 'code' in solr configuration file
if I upload these files in to solr single core it would overwrite the matching currency recordes e.g. AUD. Right?
is it better to have core per file? i.e. multiple cores.
This is my previous post:
apache solr csv file same values
What is the best solution? I need help. Hope someone can help out.
Thanks
GM
Some of the points you might want to think upon :-
If you have completely different entities with nothing in common and not dependant as well (no joins), it would be better to have them as Separate Cores.
This would be a much cleaner approach.
As there might be fields which have a common name and would need to be analyzed in different ways as well as
Search behaving in different ways for fields and their boost
This would also be manageable if the data is huge.
However, if you have a very small dataset with none of the above concerning you just go with a single core.
You Unique keys you can prefix the ids with the type e.g. curreny_aud and holiday_aud which will help you keep the entities seperate and prevent overwriting.
Related
I have a SOLR (or rather Heliosearch 0.07) core on a single EC2 instance. It contains about 20M documents and takes about 50GB on disc. The core is quite fixed/frozen and performs quite well, if everything is warmed up.
The problem is a multimulti value string field: That field contains assigned categories, which change quite frequently for large parts of the 20M documents. After a commit, the warm up takes way too long to be usable in production.
The field is used only for facetting and filtering. My idea was, to store the categories outside SOLR and to inject them somehow using custom code. I checked quite some approaches in various JIRA issues and blogs, but I could not find some working solution. Item 2 of this issue suggests that there is a solution, but I don't get what he's talking about.
I would appreciate any solution which allows me to update my category field without having to re-warmup my caches again afterwards.
I'm not sure that JIRA will help you: it seems an advanced topic and most impprtant it is still unresolved so not yet available.
Partial document updates are not useful here because a) it requires everything is stored in your schema b) behind the scenes it does reindex again the whole index
From what you say it seems tou have a one monolithic index: have you considered to split the index vertically using sharding or SolrCloud? In that way each "portion" would be smaller and the autowarm shouldn't be a big problem.
I have documents in SOLR which consist of fields where the values come from different source systems. The reason why I am doing this is because this document is what I want returned from the SOLR search, including functionality like hit highlighting. As far as I know, if I use join with multiple SOLR documents, there is no way to get what matched in the related documents. My document has fields like:
id => unique entity id
type => entity type
name => entity name
field_1_s => dynamic field from system A
field_2_s => dynamic field from system B
...
Now, my problem comes when data is updated in one of the source systems. I need to update or remove only the fields that correspond to that source system and keep the other fields untouched. My thought is to encode the dynamic field name with the first part of the field name being a 8 character hash representing the source system.. this way they can have common field names outside of the unique source hash. And in this way, I can easily clear out all fields that start with the source prefix, if needed.
Does this sound like something I should be doing, or is there some other way that others have attempted?
In our experience the easiest and least error prone way of implementing something like this is to have a straight forward way to build the resulting document, and then reindex the complete document with data from both subsystems retrieved at time of reindexing. Tracking field names and field removal tend to get into a lot of business rules that live outside of where you'd normally work with them.
By focusing on making the task of indexing a specific document easy and performant, you'll make the system more flexible regarding other issues in the future as well (retrieving all documents with a certain value from Solr, then triggering a reindex for those documents from a utility script, etc.).
That way you'll also have the same indexing flow for your application and primary indexing code, so that you don't have to maintain several sets of indexing code to do different stuff.
If the systems you're querying isn't able to perform when retrieving the number of documents you need, you can add a local cache (in SQL, memcached or something similar) to speed up the process, but that code can be specific to the indexing process. Usually the subsystems will be performant enough (at least if doing batch retrieval depending on the documents that are being updated).
I have a huge amount of PDF/Word/Excel/etc. files to index (40GB now, but maybe up to 1000GB in some monhts) and I was considering to use Solr, with a DataImportHandler and Tika. I have read a lot of topic on this subject, but there is one problem for which I still not found a solution : if I index all the files (full or delta import), remove a file in the filesystem, and index again (with delta import), then the document corresponding to the file will not be removed from the index.
Here are some possibilites :
Do a full import. But I want to avoid this as much as possible since I think it could be very time-consuming (several days, but not very important) and bandwidth-consuming (the main issue since files are on a shared network drive).
Implement a script which would verify, for each document in the index, if the corresponding file exist (much less bandwidth consuming). But I do not know if I shall do this inside or outside of Solr, and how.
Do you have any other idea, or a way to perform the second solution ? Thanks in advance.
Some details :
I will use the "newerThan" option of the FileListEntityProcessor to do the delta import.
If I store the date when the document has been indexed, it does not help me because if I haven't indexed one document in the last import it can be because he as been removed OR because it has not changed (delta import)
I have both stored and unstored fields, so I don't think using the new possibility of Solr 4.0 to change only one field in a document can be a solution.
Have you thought about using a file system monitor to catch deletions and update index?
I think apache.commons.io supports that.
Check out apache.commons.io.monitor package, FileAlterationObserver and FileAlterationMonitor classes.
I have setup solr with multiple cores. Each core has its own schema(with common unique id).
Core0: Id, name
Core1: Id, type.
I am looking for a way to generate the resultset as Id, name, type. Is there any way?
I tried the solution suggested here but it did not work.
Search multiple SOLR core's and return one result set
The reason for not having merged schema is as follows:
We have thousands of documents(pdf files) which need to be extracted/inserted into solr for search on every day basis. which means there will be millions of documents at some time.
Second requirement is that at any point in time, there can be a request to perform some processing on all historical pdf files and add that processed information onto solr for each file. So that processed info comes up with search.
Now since extraction takes a long time, it will be very difficult(long update time) to perform historical update for all files. So I thought if I can use another core to keep the processed info, it will be quicker. Please suggest if there is alternative solution.
How do I index text files, web sites and database in the same Solr schema? All 3 sources are a requirement and I'm trying to figure out how to do it. I did some examples and they're working fine as they're separate from each other, now I need them all to be 1 schema since the user will be searching in all of those 3 data sources.
How should I proceed?
You should sketch up a few notes for each of your content sources:
What meta-data is available
How is the information accessed
How do I want to present the information
Once that is done, determine which meta-data you want to make searchable. Some of it might be very specific to just one of the content sources (such as author on web pages, or any given field in a DB row), while others will be present in all sources (such as unique ID, title, text content). Use copy-fields to consolidate fields as needed.
Meta-data will vary greatly from project to project, but yes -- things like update date, filename, and any structured data you can parse out of the text files will surely help you improve relevance. Beyond that, it varies a lot from case to case. Maybe the file paths hint at a (possibly informal) taxonomy you can use as metadata. Maybe filenames contain metadata themselves (such as year, keyword, product names, etc).
Be prepared to use different fields for different sources when displaying results. A source field goes a long way in terms of creating result tiles -- and it might turn out to be your most used facet.
An alternative (and probably preferred) approach to using copy-fields extensively, is using the DisMax/EDisMax request handlers, to facilitate searching in several fields.
Consider using a mix of copy-fields and (e)dismax. For instance, copy all fields into a catch-all text-field, that need not be stored, and include it in searches, but with a low boost-value, and include highly weighted fields (such as title, or headings, or keywords, or filename) in the search. There's a lot of parameters to tweak in dismax, but it's definately worth the effort.