Database Sharding Support in Propel

Database Sharding Support in Propel - database

Just wonder how good is Propel's support for database sharding? I am thinking about creating my application in PHP, using MySQL as the database server and Propel as the ORM.
I figure out that it may be good to keep the architecture scalable right from the start, just in case my application takes off.
What's your take?

I think that's a very bad idea. Assuming that you need to shard your data is not a good assumption. You don't know, in advance, how you're going to want to scale. Sharding is a very complicated business and needs to be avoided if at all possible. This is an obscene case of premature optimisation.

I agree with MarkR that it's too early to be worrying about sharding, but I disagree that it should be avoided if at all possible. I'd say go with the ORM that seems to fit your style and language choice -- and Propel is probably the right one in your case. Even if your application takes off in a big way, sharding probably won't be necessary -- you can easily pull off 25 million records with a MySQL-based DBMS and some decent caching techniques, so just focus on making your queries fast and design for easy memcache-integration, and you'll be a happy camper even when your app takes off.
Good luck with it!

Propel supports sharding out of the box through connections. check an example here:
http://groups.google.com/group/propel-users/browse_thread/thread/4d19c0668aa17452

Some database sharding middleware can help you.
For example: Apache ShardingSphere (https://shardingsphere.apache.org/document/current/en/features/sharding/)
There are 2 adaptors of ShardingSphere, JDBC and Proxy. JDBC is the java driver to connect database which no fit for PHP, but Proxy is just like the database (MySQL or Postgres).
ShardingSphere-Proxy can isolate the data sharding and business logic. It is better to use third party middleware to handle common problems.

Related

Server setup to host a tool like Google Analytics?

I am developing a Analytics tool similar to Google Analytics. That will store keywords, visits and pages in a database.
So the database can grow very quickly because I want to have many people using it.
How should I setup the database? One database for all the accounts and all the websites being monitored? Or it would be better to have one database for every account?
Also, I am planning to start with one dedicated server but I'm sure that I will need more than one server in the future so I have to build it keeping that in mind.
I also know that if I do multiple databases for every account then I will have to run upgrade scripts on all of them when the schema of the app will change.

What kind of database do you plan to use ? There is a BIG difference between relational (PostgreSQL, MySQL) and "NoSQL" (MongoDB, CouchDB)
I'm only going to talk about PostgreSQL on the relational side since it's the only database I have experience with.
First, I would keep everything in one database. There's no benefit in using a database per account.
Second, you should be absolutely sure you WILL outgrow a single machine. Given the kind of application you'll be dealing with a lot more writes than reads, so a master-slave replication will only serve for high availability, and multi-master replication with PostgreSQL is NOT easy.
From my last research the least painful way to do that was to use a tool like Postgres-XC which is designed to be write-scalable, but I have no idea how production-ready it is.
Another solution is using tools like Bucardo or SkyTools. No experience with SkyTools but I had a lot of trouble getting Bucardo to work last year.
The last solution is to do sharding. The naive way to shard is to do something like
shard number = id % 10. However using this you would need to rebalance your cluster whenever you add/remove a shard.
It would require that you write your application "shard-aware" so that you direct the queries to the correct shard.
Anyway like I said before, make sure you will NEED to shard/clusterize first.
Now for the "NoSQL" side, I have no experience with any of the solutions, but I do know that MongoDB and CouchDB handle sharding themselves so it's way easier with those solutions, however you give up quite a lot.

I'll expand a bit on Vincent's answer.
As for sharding we have had good experience with PL/Proxy. And with sharding you can outgrow single machine without issues (read or write).
As for replication Londiste from Skytools is very easy to set up and use. And with it you get PgQ, quite nice messaging solution for Postgres.

ORM vs traditional database query, which are their fields?

ORM seems to be a fast-growing model, with both pros and cons in their side. From Ultra-Fast ASP.NET of Richard Kiessig (http://www.amazon.com/Ultra-Fast-ASP-NET-Build-Ultra-Scalable-Server/dp/1430223839/ref=pd_bxgy_b_text_b):
"I love them because they allow me to develop small, proof-of-concept sites extremely quickly. I can side step much of the SQL and related complexity that I would otherwise need and focus on the objects, business logic and presentation. However, at the same time, I also don't care for them because, unfortunately, their performance and scalability is usually very poor, even when they're integrated with a comprehensive caching system (the reason for that becomes clear when you realize that when properly configured, SQL Server itself is really just a big data cache"
My questions are:
What is your comment about Richard's idea. Do you agree with him or not? If not, please tell why.
What is the best suitable fields for ORM and traditional database query? in other words, where you should use ORM and where you should use traditional database query :), which kind/size... of applications you should undoubtedly choose ORM/traditional database query
Thanks in advance

I can't agree to the common complain about ORMs that they perform bad. I've seen many plain-SQL applications until now. While it is theoretically possible to write optimized SQL, in reality, they ruin all the performance gain by writing not optimized business logic.
When using plain SQL, the business logic gets highly coupled to the db model and database operations and optimizations are up to the business logic. Because there is no oo model, you can't pass around whole object structures. I've seen many applications which pass around primary keys and retrieve the data from the database on each layer again and again. I've seen applications which access the database in loops. And so on. The problem is: because the business logic is already hardly maintainable, there is no space for any more optimizations. Often when you try to reuse at least some of your code, you accept that it is not optimized for each case. The performance gets bad by design.
An ORM usually doesn't require the business logic to care too much about data access. Some optimizations are implemented in the ORM. There are caches and the ability for batches. This automatic (and runtime-dynamic) optimizations are not perfect, but they decouple the business logic from it. For instance, if a piece of data is conditionally used, it loads it using lazy loading on request (exactly once). You don't need anything to do to make this happen.
On the other hand, ORM's have a steep learning curve. I wouldn't use an ORM for trivial applications, unless the ORM is already in use by the same team.
Another disadvantage of the ORM is (actually not of the ORM itself but of the fact that you'll work with a relational database an and object model), that the team needs to be strong in both worlds, the relational as well as the oo.
Conclusion:
ORMs are powerful for business-logic centric applications with data structures that are complex enough that having an OO model will advantageous.
ORMs have usually a (somehow) steep learning curve. For small applications, it could get too expensive.
Applications based on simple data structures, having not much logic to manage it, are most probably easier and straight forward to be written in plain sql.
Teams with a high level of database knowledge and not much experience in oo technologies will most probably be more efficient by using plain sql. (Of course, depending on the applications they write it could be recommendable for the team to switch the focus)
Teams with a high level of oo knowledge and only basic database experience are most probably more efficient by using an ORM. (same here, depending on the applications they write it could be recommendable for the team to switch the focus)

ORM is pretty old, at least in the Java world.
Major problems with ORM:
Object-Oriented model and Relational model are quite different.
SQL is a high level language to access data based on relational algebra, different from any OO language like C#, Java or Visual Basic.Net. Mixing those can you the worst of two worlds, instead of the best
For more information search the web on things like 'Object-relational impedance mismatch'
Either case, a good ORM framework saves you on quite some boiler-plate code. But you still need to have knowlegde of SQL, how to setup a good SQL databasemodel. Start with creating a good databasemodel using SQL, then base your OO model on that (not the other way around)
However, the above only holds if you really need to use a SQL database. I recommend looking into NoSQL movement as well. There's stuff like Cassandra, Couch-db. While google'ing for .net solutions I found this stackoverflow question: https://stackoverflow.com/questions/1777103/what-nosql-solutions-are-out-there-for-net

I'm the author of the book with the text quoted in the question.
Let me emphatically add that I am not arguing against using business objects or object oriented programming.
One issue I have with conventional ORM -- for example, LINQ to SQL or Entity Framework -- is that it often leads to developers making DB calls when they don't even realize that they're doing so. This, in turn, is a performance and scalability killer.
I review lots of websites for performance issues, and have found that DB chattiness is one of the most common causes of serious problems. Unfortunately, ORM tends to encourage chattiness, in spades.
The other complaints I have about ORM include:
No support for command batching
No support for multiple result sets
No support for table valued parameters
No support for native async calls (making them from a background thread doesn't count)
Support for SqlDependency and SqlCacheDependency is klunky if/when it works at all
I have no objection to using ORM tactically, to address specific business issues. But I do object to using it haphazardly, to the point where developers do things like make the exact same DB call dozens of time on the same page, or issue hugely expensive queries without considering caching and change notifications, or totally neglect async operations when scalability is a concern.

This site uses Linq-to-SQL I believe, and it's 'fairly' high traffic... I think that the time you save from writing the boiler plate code to access/insert/update simple items is invaluable, but there is always the option to drop down to calling a SPROC if you have something more complex, where you know you can write some screaming fast SQL directly.
I don't think that these things have to be mutually exclusive - use the advantages of both, and if there are sections of your application that start to slow down, then you can optimise as you need to.

ORM is far older than both Java and .NET. The first one I knew about was TopLink for Smalltalk. It's an idea as old as persistent objects.
Every "CRUD on the web" framework like Ruby on Rails, Grails, Django, etc. uses ORM for persistence because they all presume that you are starting with a clean sheet object model: no legacy schema to bother with. You start with the objects to model your problem and generate the persistence from it.
It often works the other way with legacy systems: the schema is long-lived, and you may or may not have objects.
It's astonishing how quickly you can get a prototype up and running with "CRUD on the web" frameworks, but I don't see them being used to develop enterprise apps in large corporations. Maybe that's a Fortune 500 prejudice.
Database admins that I know tell me they don't like the SQL that ORMs generate because it's often inefficient. They all wish for a way to hand-tune it.

I agree with most points already made here.
ORM's are not new in .NET, LLBLGen has been around for a long time, I've been using them for >5 years now in .NET.
I've seen very bad performing code written without ORMs (in-efficient SQL queries, bad indexes, nested database calls - ouch!) and bad code written with ORMs - I'm sure I've contributed to some of the bad code too :)
What I would add is that an ORM is generally a powerful and productivity-enhancing tool that allows you to stop worrying about plumbing db code for most of your application and concentrate on the application itself. When you start trying to write complex code (for example reporting pages or complex UI's) you need to understand what is happening underneath the hood - ignorance can be very costly. But, used properly, they are immensely powerful, and IMO won't have a detrimental effect on your apps performance. I for one wouldn't be happy on a project that didn't use an ORM.

Programming is about writing software for business use. The more we can focus on business logic and presentation and less with technicalities that only matter at certain points in time (when software goes down, when software needs upgrading, etc), the better.
Recently I read about talks of scalability from a Reddit founder, from here, and one line of him that caught my attention was this:
"Having to deal with the complexities
of relational databases (relations,
joins, constraints) is a thing of the
past."
From what I have watched, maintaining a complex database schema, when it comes to scalability, becomes a major pain as the site grows (you add a field, you reassign constraints, re-map foreign keys...etc). It was not entirely clear to me as to why is that. They're not using a NOSQL database though, they're in Postgres.
Add to that, here comes ORM, another layer of abstraction. It simplifies code writing, but almost often at a performance penalty. For me, a simple database abstraction library will do, much like lightweight AR libs out there together with database-specific "plain text" queries. I can't show you any benchmark but with the ORMs I have seen, most of them say that "ORM can often be slow".
Richard covers both sides of the coin, so I agree with him.
As for the fields, I really don't quite get the context of the "fields" you are asking about.

As others have said, you can write underperforming ORM code, and you can also write underperforming SQL.
Using ORM doesn't excuse you from knowing your SQL, and understanding how a query fits together. If you can optimize a SQL query, you can usually optimize an ORM query. For example, hibernate's criteria and HQL queries let you control which associations are joined to improve performance and avoid additional select statements. Knowing how to create an index to improve your most common query can make or break your application performance.
What ORM buys you is uniform, maintainable database access. They provide an extra layer of verification to ensure that your OO code matches up as closely as possible with your database access, and prevent you from making certain classes of stupid mistake, like writing code that's vulnerable to SQL injection. Of course, you can parameterize your own queries, but ORM buys you that advantage without having to think about it.

Never got anything but pain and frustration from ORM packages. If I'd write my SQL the way they autogen it - yeah I'd claim to be fast while my code would be slow :-) Have you ever seen SQL generated by an ORM ? Barely has PK-s, uses FK-s only for misguided interpretation of "inheritance" and if it wants to do paging it dumps the whole recordset on you and then discards 90% of it :-))) Then it locks everything in sight since it has to take in a load of records like it went back to 50 yr old IBM's batch processing.
For a while I thought that the biggest problem with ORM was splintering (not going to have a standard in 50 yrs - every year different API, pardon "model" :-) and ideologizing (everyone selling you a big philosophy - always better than everyone else's of course :-) Then I realized that it was really the total amateurism that's the root cause of the mess and everything else is just the consequence.
Then it all started to make sense. ORM was never meant to be performant or reliable - that wasn't even on the list :-) It was academic, "conceptual" toy from the day one, the consolation prize for professors pissed off that all their "relational" research papers in Prolog went down the drain when IBM and Oracle started selling that terrible SQL thing and making a buck :-)
The closest I came to trusting one was LINQ but only because it's possible and quite easy to kick out all "tracking" and use is just as deserialization layer for normal SQL code. Then I read how the object that's managing connection can develop spontaneous failures that sounded like premature GC while it still had some dangling stuff around. No way I was going to risk my neck with it after that - nope, not my head :-)
So, let me make a list:
Totally sloppy code - not going to suffer bugs and poor perf
Not going to take deadlocks from ORM's 10-100 times longer "transactions"
Drastic reduction of capabilities - SQL has huge expressive power these days
Tying you up into fringe and sloppy API (every ORM aims to hijack your codebase)
SQL queries are highly portable and SQL knowledge is totally portable
I still have to know SQL just to clean up ORM's mess anyway
For "proof-of-concept" I can just serialize to binary or XML files
not much slower, zero bug libraries and one XPath can select better anyway
I've actually done heavy traffic web sites all from XML files
if I actually need real graph then I have no use for DB - nothing real to query
I can serialize a blob and dump into SQL in like 3 lines of code
If someone claims that he does it all from DB to UI - keep your codebase locked :-)
and backup your payroll DB - you'll thank me latter :-)))
NoSQL bases are more honest than ORM - "we specialize in persistence"
and have better code quality - not surprised at all
That would be the short list :-) BTW, modern SQL engines these days do trees and spatial indexing, not to mention paging without a single record wasted. ORM-s are actually "solving" problems of 10yrs ago and promoting amateurism. To that extent NoSQL, also known as document

What's the meaning of ORM?

I ever developed several projects based on python framework Django. And it greatly improved my production. But when the project was released and there are more and more visitors the db becomes the bottleneck of the performance.
I try to address the issue, and find that it's ORM(django) to make it become so slow. Why? Because Django have to serve a uniform interface for the programmer no matter what db backend you are using. So it definitely sacrifice some db's performance(make one raw sql to several sqls and never use the db-specific operation).
I'm wondering the ORM is definitely useful and it can:
Offer a uniform OO interface for the progarammers
Make the db backend migration much easier (from mysql to sql server or others)
Improve the robust of the code(using ORM means less code, and less code means less error)
But if I don't have the requirement of migration, What's the meaning of the ORM to me?
ps. Recently my friend told me that what he is doing now is just rewriting the ORM code to the raw sql to get a better performance. what a pity!
So what's the real meaning of ORM except what I mentioned above?
(Please correct me if I made a mistake. Thanks.)

You have mostly answered your own question when you listed the benefits of an ORM. There are definitely some optimisation issues that you will encounter but the abstraction of the database interface probably over-rides these downsides.
You mention that the ORM sometimes uses many sql statements where it could use only one. You may want to look at "eager loading", if this is supported by your ORM. This tells the ORM to fetch the data from related models at the same time as it fetches data from another model. This should result in more performant sql.
I would suggest that you stick with your ORM and optimise the parts that need it, but, explore any methods within the ORM that allow you to increase performance before reverting to writing SQL to do the access.

A good ORM allows you to tune the data access if you discover that certain queries are a bottleneck.
But the fact that you might need to do this does not in any way remove the value of the ORM approach, because it rapidly gets you to the point where you can discover where the bottlenecks are. It is rarely the case that every line of code needs the same amount of careful hand-optimisation. Most of it won't. Only a few hotspots require attention.
If you write all the SQL by hand, you are "micro optimising" across the whole product, including the parts that don't need it. So you're mostly wasting effort.

here is the definition from Wikipedia
Object-relational mapping is a programming technique for converting data between incompatible type systems in relational databases and object-oriented programming languages. This creates, in effect, a "virtual object database" that can be used from within the programming language.

a good ORM (like Django's) makes it much faster to develop and evolve your application; it lets you assume you have available all related data without having to factor every use in your hand-tuned queries.
but a simple one (like Django's) doesn't relieve you from good old DB design. if you're seeing DB bottleneck with less than several hundred simultaneous users, you have serious problems. Either your DB isn't well tuned (typically you're missing some indexes), or it doesn't appropriately represents the data design (if you need many different queries for every page this is your problem).
So, i wouldn't ditch the ORM unless you're twitter or flickr. First do all the usual DB analysis: You see a lot of full-table scans? add appropriate indexes. Lots of queries per page? rethink your tables. Every user needs lots of statistics? precalculate them in a batch job and serve from there.

ORM separates you from having to write that pesky SQL.
It's also helpful for when you (never) port your software to another database engine.
On the downside: you lose performance, which you fix by writing a custom flavor of SQL - that it tried to insulate from having to write in the first place.

ORM generates sql queries for you and then return as object to you. that's why it slower than if you access to database directly. But i think it slow a little bit ... i recommend you to tune your database. may be you need to check about index of table etc.
Oracle for example, need to be tuned if you need to get faster ( i don't know why, but my db admin did that and it works faster with queries that involved with lots of data).
I have recommendation, if you need to do complex query (eg: reports) other than (Create Update Delete/CRUD) and if your application won't use another database, you should use direct sql (I think Django has it feature)

Django database scalability

We have a new django powered project which have a potential heavy-traffic characteristic(means a heavy db interaction). So we need to consider the database scalability in advance. With some researches, the following questions are still not clear to us:
coarse-grained: how to specify one db table(a django model) to a specific db(maybe in another server)?
fine-grained: how to specify a group of table rows to a specific db(so-called sharding, also can in another db server)?
how to specify write and read to different db?(which will be helpful for future mysql master/slave replication)
We are finding the solution with:
be transparent to application program(means we don't need to have additional codes in views.py)
should be in ORM level(means only needs to specify in models.py)
compatible with the current(or future) django release(to keep a minimal change for future's upgrading of django)
I'm still doing the research. And will share in this thread later if I've got some fruits.
Hope anyone with the experience can answer. Thanks.

Don't forget about caching either. Using memcached to relieve your DB of load is key to building a high performance site.
As alex said, django-core doesn't support your specific requests for those features, though they are definitely on the todo list.
If you don't do this in the application layer, you're basically asking for performance trouble. There aren't any really good open source automation layers for this sort of task, since it tends to break SQL axioms. If you're really concerned about it, you should be coding the entire application for it, not simply hoping that your ORM will take care of it.

There is the GSoC project by Alex Gaynor that in future will allow to use multiple databases in one Django project. But now there is no cross-RDBMS working solution.
There is no solution right now too.
And again - there is no cross-RDBMS solution. But if you are using MySQL you can try excellent third-party Django application called - mysql_replicated. It allows to setup master-slave replication scenario easily.

here for some reason we r using django with sqlalchemy. maybe combination of django and sqlalchemy also works for your needs.

How to best search against a DB with Lucene?

I am looking into mechanisms for better search capabilities against our database. It is currently a huge bottleneck (causing long-lasting queries that are hurting our database performance).
My boss wanted me to look into Solr, but on closer inspection, it seems we actually want some kind of DB integration mechanism with Lucene itself.
From the Lucene FAQ, they recommend Hibernate Search, Compass, and DBSight.
As a background of our current technology stack, we are using straight JSPs on Tomcat, no Hibernate, no other frameworks on top of it... just straight Java, JSP, and JDBC against a DB2 database.
Given that, it seems Hibernate Search might be a bit more difficult to integrate into our system, though it might be nice to have the option of using Hibernate after such an integration.
Does anyone have any experiences they can share with using one of these tools (or other similar Lucene based solutions) that might help in picking the right tool?
It needs to be a FOSS solution, and ideally will manage updating Lucene with changes from the database automagicly (though efficiently), without extra effort to notify the tool when changes have been made (otherwise, it seems rolling my own Lucene solution would be just as good). Also, we have multiple application servers with just 1 database (+failover), so it would be nice if it is easy to use the solution from all application servers seamlessly.
I am continuing to inspect the options now, but it would be really helpful to utilize other people's experiences.

When you say "search against a DB", what do you mean?
Relational databases and information retrieval systems use very different approaches for good reason. What kind of data are you searching? What kind of queries do you perform?
If I were going to implement an inverted index on top of a database, as Compass does, I would not use their approach, which is to implement Lucene's Directory abstraction with BLOBs. Rather, I'd implement Lucene's IndexReader abstraction.
Relational databases are quite capable of maintaining indexes. The value that Lucene brings in this context is its analysis capabilities, which are most useful for unstructured text records. A good approach would leverage the strengths of each tool.
As updates are made to the index, Lucene creates more segments (additional files or BLOBs), which degrade performance until a costly "optimize" procedure is used. Most databases will amortize this cost over each index update, giving you more stable performance.

I have had good experiences with Compass. It has really good integration with hibernate and can mirror data changes made through hibernate and jdbc directly to the Lucene indexes though its GPS devices http://www.compass-project.org/docs/1.2.2/reference/html/gps-jdbc.html.
Maintaining the Lucene indexes on all your application servers may be an issue. If you have multiple App servers updating the db, then you may hit some issues with keeping the index in sync with all the changes. Compass may have an alternate mechanism for handling this now.
The Alfresco Project (CMS) also uses Lucene and have a mechanism for replicating Lucene index changes between servers that may be useful in handling these issues.
We started using Compass before Hibernate Search was really off the ground so I cannot offer any comparison with it.

LuSql http://code.google.com/p/lusql/ allows you to load the contents of a JDBC-accessible database into Lucene, making it searchable. It is highly optimized and multi-threaded. I am the author of LuSql and will be coming out with a new version (re-architected with a new plugable architecture) in the next month.

For a pure performance boost with searching Lucene will certainly help out a lot. Only index what you care about/need and you should be good. You could use Hibernate or some other piece if you like but I don't think it is required.

Well, it seems DBSight doesn't meet the FOSS requirement, so unless it is an absolutely stellar solution, it is not an option for me right now...

Categories

azure-form-recognizer

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight