We have a system that enables users to create applications and store data on their application. We want to separate the index of each application. We create a core for each application and search on the given application when user make query. Since there isn't any relation between the applications, this solution could perform better than the storing all index together.
I have two questions related to this.
Is this a good solution? If not could you please suggest any better solution?
Is there a limit on the number of core that I can create on Solr? There will be thousands maybe more application on the system.
Yes, it COULD be a good solution, as always depends on the specific use case\
Look at this jira issue where Erick mentions a 10k core system...so
it seems it could work for you, should need to assess the hardware etc
Related
I have a Spring Boot/React application. I have a list of users in my database I will have populated already from LDAP.
As part of a form, I need to allow users to specify a list of users. Since they could be searching from (and technically specifying as well), up to 400,000 users (most will be in the 10k or less range), I'm assuming I'd need to do this both client and server-side.
Does anyone have any recommendations on the approach or technologies?
I'm not using a small amount of data, but I don't want to over-engineer it either (tips are mostly for server-side, but any are welcome).
If you are using hibernate as the ORM in your application, you may also checkout Hibernate Search. This seems to serve your purpose as I feel that searching through a list of users can be done using a normal text based index. Hibernate search leverages Lucene, which is suitable for text based indexing and searching.
While another answer is good and works perfectly fine when you have a small set of data but be aware of the few design issue with it.
Lucene is not distributed and you can't easily scale it to multiple horizontal machines without duplicating the whole index, which is perfectly fine when you have a small set of data and in-fact it's pretty fast as there will be no network call(in case of elasticsearch, it will be).
If you want to build a stateless application that is easy to HS(horizontally scalablele) then going with Lucene will not be helpful as it stateful and you need to create Lucene index before your newly spawned app-server finished local indexing in Lucene.
Elasticsearch(ES) is rest-based and is written in JAVA and has very good java-client which you can easily use for simple to complex use-cases.
Last but not the least, please go through the STOF answer of none other than shay banon, creator of Elasticsearch, who explains why he created ES in first place :) and which will give more trade-off and insights to choose a best solution for your use-case.
I am new to Solr, and am trying to figure out the best way to index and search our catalogs.
We have to index multiple manufactures and each manufacturer has a different catalog per country. Each catalog for each manufacture per country is about 8GB of data.
I was thinking it might be easier to have an index per manufacture per country and have some way to tell Solr in the URL which index to search from.
Is that the best way of doing this? If so, how would I do it? Where should I start looking? If not, what would be the best way?
I am using Solr 3.5
In general there are two ways of solving this:
Split each catalog into its own core, running a large multi core setup. This will keep each index physically separated from each other, and will allow you to use different properties (language, etc) and configuration for each core. This might be practical, but will require quite a bit of overhead if you plan on searching through all the core at the same time. It'll be easy to split the different cores into running on different servers later - simply spin the cores up on a different server.
Run everything in a single core - if all the attributes and properties of the different catalogs are the same, add two fields - one containing the manufacturer and one containing the country. Filter on these values when you need to limit the hits to a particular country or manufacturer. It'll allow you to easily search the complete index, and scalability can be implemented by replication or something like SolrCloud (coming in 4.0). If you need multilanguage support you'll have to have a field for each language with the settings you need for that language (such as stemming).
There are a few tidbits of information about this on the Solr wiki, but my suggestion is to simply try one of the methods and see if that solves your issue. Moving to the other solution shouldn't be too much work. The simplest implementation is to keep everything in the same index.
I need to implement master/slave/load balancing into an existing site.
Does anyone use these (or other) implementations for master/slave switching?
The resources I found on how to implement master/slave into Cake:.
(preferable) gamephase.net/posts/view/master-slave-datasource-behavior-cakephp
http://bakery.cakephp.org/articles/view/master-slave-support-also-with-multiple-slave-support
http://bakery.cakephp.org/articles/view/load-balancing-and-mysql-master-and-slaves-2
I'm getting number 1) to work most of the times but it has trouble with some of the joins.
I welcome new sources, hacks or mods for master/slave implementation as for now I can't get my head around it.
(Cake version I am using atm is 1.2)
(I'm cross posting this on CakePHP's google groups http://groups.google.co.uk/group/cake-php/browse_thread/thread/4b77af429759e08f)
Take a look at this tutorial in regards to Master/Slave over several nodes.
http://www.howtoforge.com/setting-up-master-master-replication-on-four-nodes-with-mysql-5-on-debian-etch
This may help you understand better.
As far as I can tell this happens if your model has relationships with models that do not use the same behaviour. Please correct me if this assumption is wrong.
All models have meta-data, which CakePHP accumulates using a DESCRIBE query on the database, if this data is not present your joins will be broken. This meta-data is database config specific.
CakePHP uses this meta-data to populate the $this->_schema property. SQL joins are built with data from the $this->_schema property and I guess this is where your issue lies, the database introduced by this MasterSlave switch behaviour do not have any model meta-data for tables associated with the model.
A solution would be to update your behaviour so that it only switches selectively on read and writes. Add this behaviour to all related models. i.e Any model that is related using hasOne, hasMany etc should also use the same behaviour.
In essence all models that are related should write to the same database and read from the same database.
The bonus of this solution is you will share the same database connections.
Your web app seems to be multi tier, you need to scale each tier individually:
The web layer, i.e. tha CakePHP app can be spread across multiple web servers. This is easy to do, as the code itself is idempotent. You should look into how to load balance apache servers, it is not a big deal. Webservers have quite high throughput though, so if you have a bottleneck here, you might improve your code/caching strategy instead. (Use memcache instead of file caches for example.) If you depend on the file system (uploads for example) this becomes a bit more complex, as it must become distributed or separated.
The data layer. There are various tutorials how to scale/load balance MySQL already linked by others.
Albeit first I would suggest to make benchmarks. (Premature optimization is the root of all evil.) You must know first where the bottlenecks are, where the throughput should scale. Often you can optimize queries, caching, or make thing cacheable in the first place. You must also be clear in your goals: scalability? fault tolerance?
I am in process of designing my CMS that I am about to create. I was thinking about the database and how I want to go by approaching it.
Do you think its best to create 1 master database for all my clients websites? or Should I have 1 database per site?
What is the benefits and negatives on both approaches? I am always thinking about the future so I was thinking about implementing memcache or APC cache to the project, to offer an option to my client.
Just trying to learn the best practices and what other developers apporach would be
I've run both. My business chooses to separate client-specific data into separate tables so that if one happens to go corrupt, not all are taken down. In an ideal world this might never happen, but murphy's law....It does seem very easy to find things with them separated. You will know with 100% certainty that one client's content will never show up on another's page.
If you do go down that route, be prepared to create scripts that build and configure databases for you. There's nothing fun about building a great system and having demand for it, only to spend your time manually setting up DB's and installs all day long. Also, setting db names is one additional step that's not part of using a single db table--it's a headache that will repeat itself seemingly over and over again.
Develop the single master DB. It will take a small amount of additional effort and add a little bit more complexity to the database design, but will give you a few nice features. The biggest is being able to share data between sites.
Designing for a master database means that you have the option to combine sites when it makes sense, but also lets you install a master per site. Best of both worlds.
It depends greatly upon the amount of customization each client will require. If you forsee clients asking for many one-off features specific to their deployment, separate databases based off of a single core structure might make sense. I would highly recommend trying to make any customizations usable by all clients though, and keep all structure defined in one place/database instead of duplicating it across multiple databases. By using one database, you make updating the structure straightforward and the implementation consistent across all sites so they can all use the same CMS code.
I am considering using Solr in a multi-tenant application and I am wondering if there are any best practices or things I should watch out for?
One question in particular is would it make sense to have a Solr Core per tenant. Are there any issues with have a large number of Solr Cores?
I am considering use a core per tenant because I could secure each core separately.
Thanks
Solr Cores are an excellent idea for multitenant, particularly as they can be managed at runtime (so not requiring a server restart). You shouldn't run into too many problems with performance for having multiple Solr cores, but be aware the performance of one core will be impacted by the work on other cores - they're probably going to be sharing the same disk.
I can see why you might want to give direct API access - for example if each 'user' is a Drupal site or similar, for a shared hosting type environment. The best thing would be to secure the different URLs, e.g. if you had /solr/admin/cores, /solr/client1 for a client core, and /solr/client2 for another, you would have three different authentications, one for your admin, and one each for your tenants. This is done in the container (Jetty, Tomcat etc.), take a look at the general Solr Security page: http://wiki.apache.org/solr/SolrSecurity - you'll want to setup a basic access login for each path in the same way.
You would no more use a separate table in a database for each tenant than you would a solr core for each tenant.
If you think of a core like a database table and organize your project in such a way that each core represents an object in your problem space then you can better leverage solr.
Where solr shines in when you need to index text and then search it quickly. If you are not doing that you might as well use a relational database.
Also from your question about securing solr for each tenant , I hope you're not suggesting allowing your logged in users to access the solr output directly? Your users should not be able to directly access your solr instance.
Good luck.
That's OK .. you can not use cache(inbuild) properly and for your requirements. You add permission bit in which you can change the query component in which you can. It should work properly according to the permission. There is a bitwise operation also available for this. Make use of this for your needs.