Fooling Around With Databases, Online Development

Fooling Around With Databases, Online Development - database

I'm trying to design a database for a small project I'm working on. Eventually, I'd like to make it a web-app, but right now I don't mind just experimenting with data offline. However, I'm stuck in a crossroads.
The basic concept would be a user inputs values for 10 fields, to be compared against what is in the database, with each item having a weighted value. I know that if I was to code it, that I could use a look-up tables for each field, add up the values, and display the result to the end-user.
Another example would be having to get the distance between two points, each point stored in a row, with the X value getting its own column as well as the Y value.
Now, if I store data within a database, should I try to do everything within queries (which I think would involve temporary tables among other things), or just use simple queries, and manipulate the rows returned within the application code?
Right now, I'm thinking to go for the latter (manipulate data within the app) and just use queries to reduce the amount of data that I would have to go sort through. What would you guys suggest?
EDIT: Right now I'm using Microsoft Access to get the basics down pat and try to get a good design going. IIRC with my experience with Oracle and MySQL you can run commands together in a batch process and return just one result. But not sure if you can do that with Access.

If you're using a database I would strongly suggest using SQL to do all your manipulation. SQL is far more capable and powerful for this kind of job as compared to imperative programming languages.
Of course it does imply that you're comfortable in thinking about data as "sets" and programming in a declarative style. But spending time now to get really comfortable with SQL and manipulating data using SQL will pay off big time in the long run. Not only for this project but for projects in the future. I would also suggest using stored procedures over queries in code because stored procedure provide a beautiful abstraction layer allowing your table design to change over time without impacting the rest of the system.
A very big part of using and working with databases is understanding Data modeling, normalization and the like. Like everything else it will be a effort but in the long run it will pay off.
May I ask why you're using Access when you have a far better database available to you such as MSSQL Express? The migration path from MSSQL Express to MSSQL or SQL Azure even is quite seamless and everything you do and experience today (in this project) completely translates to MSSQL Server/SQL Azure for future projects as well as if this project grows beyond your expectations.
I don't understand your last statement about running a batch process and getting just one result, but if you can do it in Oracle and MySQL then you can do it in MSSQL Express as well.

What Shiv said, and also...
A good DBMS has quite a bit of solid engineering in it. There are two components that are especially carefully engineered, namely the query optimizer and the transaction controller. If you adopt the view of using the DBMS as just a stupid table retrieval tool, you will most likely end up inventing your own optimizer and transaction controller inside the application. You won't need the transaction controller until you move to an environment that supports multiple concurrent users.
Unless your engineering talents are extraordinary, you will probably end up with a home brew data management system that is not as good as the one in a good DBMS.
The learning curve for SQL is steep. You need to learn how to phrase queries that join, project, and restrict data from multiple tables. You need to learn how to handle updates in the context of a transaction.
You need to learn simple and sound table design and index design. This includes, but is not limited to, data normalization and data modeling. And you need a DBMS with a good optimizer and good transaction control.
The learning curve is steep. But the view from the top is worth the climb.

Related

Techniques for understanding complicated legacy database

I have inherited a complex, poorly documented SQL Server database and am trying to reverse engineer the tables, their relationships and in general how the data fits together. To do this I am drawing the table diagrams using Visio, attempting to determine the data relationships using the existing queries (there are no relationships actually defined) and using the application logic (what little there is) to help.
This is however quite challenging given the lack of documentation / obfuscated nature of this system and I was wondering if there are any techniques / approaches that can help?

Not an easy job, and the bigger it is the harder it is. I do however have one small trick for discovering the necessary surface area of the database - remove all user access!
This sort of database often runs with full access (i.e. users are DBO level or above) so anyone can do anything. If you remove all user access (but not your own!), then start application testing and enable minimum permissions on a per object basis, you will eventually reveal the surface area of the database (the exposed objects). This information is very useful in helping to identify dead objects - your job is hard enough as it is without having to document SPs that are no longer used for example.
How successful this technique can be depends on how thorough you think you can be in your testing, how sure you can be that someone somewhere doesn't have an Excel workbook that connects to the database once a year to run some query that no-one else ever does.

Does it make sense to use an OR-Mapper?

Does it make sense to use an OR-mapper?
I am putting this question of there on stack overflow because this is the best place I know of to find smart developers willing to give their assistance and opinions.
My reasoning is as follows:
1.) Where does the SQL belong?
a.) In every professional project I have worked on, security of the data has been a key requirement. Stored Procedures provide a natural gateway for controlling access and auditing.
b.) Issues with Applications in production can often be resolved between the tables and stored procedures without putting out new builds.
2.) How do I control the SQL that is generated? I am trusting parse trees to generate efficient SQL.
I have quite a bit of experience optimizing SQL in SQL-Server and Oracle, but would not feel cheated if I never had to do it again. :)
3.) What is the point of using an OR-Mapper if I am getting my data from stored procedures?
I have used the repository pattern with a homegrown generic data access layer.
If a collection needed to be cached, I cache it. I also have experience using EF on a small CRUD application and experience helping tuning an NHibernate application that was experiencing performance issues. So I am a little biased, but willing to learn.
For the past several years we have all been hearing a lot of respectable developers advocating the use of specific OR-Mappers (Entity-Framework, NHibernate, etc...).
Can anyone tell me why someone should move to an ORM for mainstream development on a major project?
edit: http://www.codinghorror.com/blog/2006/06/object-relational-mapping-is-the-vietnam-of-computer-science.html seems to have a strong discussion on this topic but it is out of date.
Yet another edit:
Everyone seems to agree that Stored Procedures are to be used for heavy-duty enterprise applications, due to their performance advantage and their ability to add programming logic nearer to the data.
I am seeing that the strongest argument in favor of OR mappers is developer productivity.
I suspect a large motivator for the ORM movement is developer preference towards remaining persistence-agnostic (don’t care if the data is in memory [unless caching] or on the database).
ORMs seem to be outstanding time-savers for local and small web applications.
Maybe the best advice I am seeing is from client09: to use an ORM setup, but use Stored Procedures for the database intensive stuff (AKA when the ORM appears to be insufficient).

I was a pro SP for many, many years and thought it was the ONLY right way to do DB development, but the last 3-4 projects I have done I completed in EF4.0 w/out SP's and the improvements in my productivity have been truly awe-inspiring - I can do things in a few lines of code now that would have taken me a day before.
I still think SP's are important for some things, (there are times when you can significantly improve performance with a well chosen SP), but for the general CRUD operations, I can't imagine ever going back.
So the short answer for me is, developer productivity is the reason to use the ORM - once you get over the learning curve anyway.

A different approach... With the raise of No SQL movement now, you might want to try object / document database instead to store your data. In this way, you basically will avoid the hell that is OR Mapping. Store the data as your application use them and do transformation behind the scene in a worker process to move it into a more relational / OLAP format for further analysis and reporting.

Stored procedures are great for encapsulating database logic in one place. I've worked on a project that used only Oracle stored procedures, and am currently on one that uses Hibernate. We found that it is very easy to develop redundant procedures, as our Java developers weren't versed in PL/SQL package dependencies.
As the DBA for the project I find that the Java developers prefer to keep everything in the Java code. You run into the occassional, "Why don't I just loop through all the Objects that just returned?" This caused a number of "Why isn't the index taking care of this?" issues.
With Hibernate your entities can contain not only their linked database properties, but can also contain any actions taken upon them.
For example, we have a Task Entity. One could Add or Modify a Task among other things. This can be modeled in the Hibernate Entity in Named Queries.
So I would say go with an ORM setup, but use procedures for the database intensive stuff.
A downside of keeping your SQL in Java is that you run the risk of developers using non-parameterized queries leaving your app open to a SQL Injection.

The following is just my private opinion, so it's rather subjective.
1.) I think that one needs to differentiate between local applications and enterprise applications. For local and some web applications, direct access to the DB is okay. For enterprise applications, I feel that the better encapsulation and rights management makes stored procedures the better choice in the end.
2.) This is one of the big issues with ORMs. They are usually optimized for specific query patterns, and as long as you use those the generated SQL is typically of good quality. However, for complex operations which need to be performed close to the data to remain efficient, my feeling is that using manual SQL code is stilol the way to go, and in this case the code goes into SPs.
3.) Dealing with objects as data entities is also beneficial compared to direct access to "loose" datasets (even if those are typed). Deserializing a result set into an object graph is very useful, no matter whether the result set was returned by a SP or from a dynamic SQL query.
If you're using SQL Server, I invite you to have a look at my open-source bsn ModuleStore project, it's a framework for DB schema versioning and using SPs via some lightweight ORM concept (serialization and deserialization of objects when calling SPs).

What should I have in mind when building OLAP solution from scratch?

I'm working for a company running a software product based on a MS SQL database server, and through the years I have developed 20-30 quite advanced reports in PHP, taking data directly from the database. This has been very successful, and people are happy with it.
But it has some drawbacks:
For new changes, it can be quite development intensive
The user can't experiment much with the data - it is locked to a hard-coded view
It can be slow for big reports
I am considering gradually going to a OLAP-based approach, which can be queried from Excel or some web-based service. But I would like to do this in a way that introduces the least amount of new complexity in the IT environment - the least amount of different services, synchronization jobs etc!
I have some questions in this regard:
1) Workflow-related:
What is a good development route from "black box SQL server" to "OLAP ready to use"?
Which servers and services should be set up, and which scripts should be written?
Which are the hardest/most critical/most time-intensive parts?
2) ETL:
I suppose it is best to have separate servers for their Data Warehouse and Production SQL?
How are these kept in sync (push/pull)? Using which technologies/languages?
For me SSIS looks overly complicated, and the graphical workflow doesn't appeal much to me -- I would rather like a text based script that does the job. Is this feasible?
Or is it advantagous to use the graphical client with only one source and one destination?
3) Development:
How much of this (data integration, analysis services) can be efficiently maintained from a CLI-tool?
Can the setup be transferred back and forth between production and development easily?
I'm happy with any answer that covers just some of this - and even though it is a MS environment, I'm also interested to hear about advantages in other technologies.

I only have experience with Microsoft OLAP, so here are my two cents regarding what I know:
If you are implementing cubes, then separate the production SQL Server from the source for the cubes. Cubes require a lot of SELECT DISTINCT column_name FROM source.table. You don't want cube processing to block your mission critical production system.
Although you can implement OLAP cubes with standard relation tables, you will quickly find that unless your data is a ledger-style system you will probably need to fully reprocess your fact and dimension tables and this will require requerying the source database over and over again. That's a large argument for building a separate data warehouse that uses ledger-style transactions for the fact tables. For instance, if a customer orders something and then cancels it, your source system may track this as a status change. In your fact table, you probably need to show this as a row for ordering that has a positive quantity and revenue stream and a row for cancelling that has a negative quantity and revenue stream.
OLAP may be overkill for your environment. The main issue you appeared to raise was that your reports are static and users want access to the data directly. You could build a data model and give users Report Builder access in SSRS, or report writing access in some other BI suite like Cognos, Business Objects, etc. I don't generally recommend this approach since it is way beyond what most users should have to know to get data, but in a small shop this may be sufficient and it is easy to implement. Let's face it -- users generally just want to get the data into Excel to manipulate it further. So if you don't want to give them a web front-end and you just want them to get to the data from Excel, you could give them direct database access to a copy of the production data. The downside of this approach is users don't generally understand SQL or database relationships. OLAP helps you avoid forcing users to learn SQL or relationships, but is isn't easy to implement on your end. If you only have a couple of power users who need this kind of access, it could be easy enough to teach the few power users how to do basic queries in Excel against the database and they will be happy to get this tomorrow. OLAP won't be ready by tomorrow.
If you only have a few kinds of source data systems, you could get away with building a super-dynamic static report. For instance, I have a report that was written in C# that basically allows users to select as many columns as they want from a list of 30 columns and filter the data on a few date range fields and field filter lists. This simple report covers about 40% of all ad hoc report requests from end-users since it covers all the basic, core customer metrics and fields. We recently moved this report to SSRS and that allowed us to up the number of fields to about 100 and improved the overall user experience. Regardless of the reporting platform, it is possible to give users some dynamic flexibility even in the confines of a static reporting system.
If you only have a couple of databases, you can probably backup and restore the databases as your ETL. However, if you want to do anything beyond that, then you might as well bite the bullet and use SSIS (or some other ETL tool). Once you get into ETL for data warehousing, you are going to use a graphic-oriented design tool. Coding works well for applications, but ETL is more about workflows and that's why the tools tend to converge on a graphical UI. You can work around this and try to code a data warehouse from a text editor, but in the end you are going to lose out on a lot. See this post for more details on the differences between loading data from code and loading data from SSIS.
FEEDBACK ON HOW TO USE CUBES WITH A RELATIONAL DATA STORE
It is possible to implement a cube over a relational data store, but there are some major problems with using this approach. The main reason it is technically feasible has to do with how you configure your DSV. The DSV is essentially a logical layer between the physical database and the cube/dimension definitions. Instead of importing the relational tables into the DSV, you could define Named Queries or create views in the database that flatten the data.
The advantage of this approach are as follows:
It is relatively easy to implement since you don't have to build an entire ETL subsystem to get started with OLAP.
This approach works well for prototyping how you want to build a more long-term solution. You can prototype it in 1-2 days and show some of the benefits of OLAP today.
Some very, very large tables don't have to be completely duplicated just to support an OLAP cube. I have several multi-billion row tables that are almost completely standardized fact tables. The only columns they don't have are date keys and they also contain some NULL values on fields that shouldn't have nulls at all. Instead of duplicating these very massive tables, you can create the surrogate date keys and set values for the nulls in the view or named query. If you aren't going to see a huge performance boon for duplicating the table, then this may be a candidate for leaving in a more raw format in the database itself.
The disadvantages of this approach are as follows:
If you haven't built a true Kimball method data warehouse, then you probably aren't tracking transactions in a ledger-style. Kimball method fact tables (at least as I understand them) always change values by adding and subtracting rows. If someone cancels part of an order, you can't update the value in the cube for the single transaction. Instead, you have to balance out the transaction with a negative value. If you have to update the transaction, then you will have to fully reprocess the partition of the cube to replace the value which can be a very expensive operation. Unless your source system is a ledger-style transaction system, you will probably have to build a ledger-style copy in your ETL subsystem.
If you don't build a Kimball method data warehouse, then you are probably using unobscured and possibly non-integer primary keys in your database. This directly impacts query performance inside the cube. It also sets you up for having a theoretically inflexible data warehouse. For instance, if you have an product ordering system that uses an integer key and you start using a second product ordering system either as a replacement for the legacy system or in tandem with the legacy system, you may struggle to combine the data together merely through the DSV since each system has different data points, metrics, workflows, data types, etc. Worse, if they have the same data types for the order id and the order id values overlap between systems, then you must declare a surrogate key that you can use across both systems. This can be difficult, but not impossible, to implement without using a flattened data warehouse.
You may have to build the system twice if you start with the relational data store and then move to flattened database. Frankly, I think the amount of duplicated work is trivial. Most of what you learned building the cube off a relational data store will translate to setting up the new OLAP cube. The main problem, though, is that you will probably create a new cube altogether and then any users of the old cube will have to migrate to the new cube. Any reports built in SSRS or Excel will probably break at that point and need to be rewritten from the ground up. So the main cost of rebuilding the cube is really on rebuilding dependent objects -- not on rebuilding the cube itself.
Let me know if you want me to expand on any of the above points. good luck.

You're basically asking the million dollar question of "How do I build a DWH". This is not really a question that can decisively be answered.
Nevertheless, here is a kickstart:
If you are looking for a minimum viable product, be aware that you are in a data environment, and not a pure software one. In data-heavy environments, it is much harder to incrementally build a product, because the amount of effort to introduce changes in the system is much greater. Think about it as if every change you make in a piece of software has to be somehow backwards-compatible with anything you've ever done. Now you understand the hell Microsoft are in :-).
Also, data systems involve many third-party tools such as DBs, ETL tools and reporting platforms. The choices you make should be viable for the expected development of your system, else you might have to completely replace these tools down the road.
While you can start with a DB cloning that will be based on simple copy SQLs and then aggregating it or pushing it into an OLAP, I would recommend getting your hands dirty with a real ETL tool from the start. This is especially true if you foresee the need to grow. 9 out of 10 times, the need will grow.
MS-SQL is a good choice for a DB if you don't mind the cost. The natural ETL tool would be SSIS, and it's a solid tool as well.
Even if your first transformations are merely "take this table and dump it in there", you still gain a lot in terms of process management (has the job run? What happens if it fails? etc) and debugging. Also, it is easier to organically grow as requirements and/or special cases have to be dealt with.

What's the meaning of ORM?

I ever developed several projects based on python framework Django. And it greatly improved my production. But when the project was released and there are more and more visitors the db becomes the bottleneck of the performance.
I try to address the issue, and find that it's ORM(django) to make it become so slow. Why? Because Django have to serve a uniform interface for the programmer no matter what db backend you are using. So it definitely sacrifice some db's performance(make one raw sql to several sqls and never use the db-specific operation).
I'm wondering the ORM is definitely useful and it can:
Offer a uniform OO interface for the progarammers
Make the db backend migration much easier (from mysql to sql server or others)
Improve the robust of the code(using ORM means less code, and less code means less error)
But if I don't have the requirement of migration, What's the meaning of the ORM to me?
ps. Recently my friend told me that what he is doing now is just rewriting the ORM code to the raw sql to get a better performance. what a pity!
So what's the real meaning of ORM except what I mentioned above?
(Please correct me if I made a mistake. Thanks.)

You have mostly answered your own question when you listed the benefits of an ORM. There are definitely some optimisation issues that you will encounter but the abstraction of the database interface probably over-rides these downsides.
You mention that the ORM sometimes uses many sql statements where it could use only one. You may want to look at "eager loading", if this is supported by your ORM. This tells the ORM to fetch the data from related models at the same time as it fetches data from another model. This should result in more performant sql.
I would suggest that you stick with your ORM and optimise the parts that need it, but, explore any methods within the ORM that allow you to increase performance before reverting to writing SQL to do the access.

A good ORM allows you to tune the data access if you discover that certain queries are a bottleneck.
But the fact that you might need to do this does not in any way remove the value of the ORM approach, because it rapidly gets you to the point where you can discover where the bottlenecks are. It is rarely the case that every line of code needs the same amount of careful hand-optimisation. Most of it won't. Only a few hotspots require attention.
If you write all the SQL by hand, you are "micro optimising" across the whole product, including the parts that don't need it. So you're mostly wasting effort.

here is the definition from Wikipedia
Object-relational mapping is a programming technique for converting data between incompatible type systems in relational databases and object-oriented programming languages. This creates, in effect, a "virtual object database" that can be used from within the programming language.

a good ORM (like Django's) makes it much faster to develop and evolve your application; it lets you assume you have available all related data without having to factor every use in your hand-tuned queries.
but a simple one (like Django's) doesn't relieve you from good old DB design. if you're seeing DB bottleneck with less than several hundred simultaneous users, you have serious problems. Either your DB isn't well tuned (typically you're missing some indexes), or it doesn't appropriately represents the data design (if you need many different queries for every page this is your problem).
So, i wouldn't ditch the ORM unless you're twitter or flickr. First do all the usual DB analysis: You see a lot of full-table scans? add appropriate indexes. Lots of queries per page? rethink your tables. Every user needs lots of statistics? precalculate them in a batch job and serve from there.

ORM separates you from having to write that pesky SQL.
It's also helpful for when you (never) port your software to another database engine.
On the downside: you lose performance, which you fix by writing a custom flavor of SQL - that it tried to insulate from having to write in the first place.

ORM generates sql queries for you and then return as object to you. that's why it slower than if you access to database directly. But i think it slow a little bit ... i recommend you to tune your database. may be you need to check about index of table etc.
Oracle for example, need to be tuned if you need to get faster ( i don't know why, but my db admin did that and it works faster with queries that involved with lots of data).
I have recommendation, if you need to do complex query (eg: reports) other than (Create Update Delete/CRUD) and if your application won't use another database, you should use direct sql (I think Django has it feature)

Need help designing big database update process

We have a database with ~100K business objects in it. Each object has about 40 properties which are stored amongst 15 tables. I have to get these objects, perform some transforms on them and then write them to a different database (with the same schema.)
This is ADO.Net 3.5, SQL Server 2005.
We have a library method to write a single property. It figures out which of the 15 tables the property goes into, creates and opens a connection, determines whether the property already exists and does an insert or update accordingly, and closes the connection.
My first pass at the program was to read an object from the source DB, perform the transform, and call the library routine on each of its 40 properties to write the object to the destination DB. Repeat 100,000 times. Obviously this is egregiously inefficent.
What are some good designs for handling this type of problem?
Thanks

This is exactly the sort of thing that SQL Server Integration Services (SSIS) is good for. It's documented in Books Online, same as SQL Server is.

Unfortunately, I would say that you need to forget your client-side library, and do it all in SQL.

How many times do you need to do this? If only once, and it can run unattended, I see no reason why you shouldn't reuse your existing client code. Automating the work of human beings is what computers are for. If it's inefficient, I know that sucks, but if you're going to do a week of work setting up a SSIS package, that's inefficient too. Plus, your client-side solution could contain business logic or validation code that you'd have to remember to carry over to SQL.
You might want to research Create_Assembly, moving your client code across the network to reside on your SQL box. This will avoid network latency, but could destabilize your SQL Server.

Bad news: you have many options
use flatfile transformations: Extract all the data into flatfiles, manipulate them using grep, awk, sed, c, perl into the required insert/update statements and execute those against the target database
PRO: Fast; CON: extremly ugly ... nightmare for maintanance, don't do this if you need this for longer then a week. And a couple dozens of executions
use pure sql: I don't know much about sql server, but I assume it has away to access one database from within the other, so one of the fastes ways to do this is to write it as a collection of 'insert / update / merge statements fed with select statements.
PRO: Fast, one technology only; CON: Requires direct connection between databases You might reach the limit of SQL or the available SQL knowledge pretty fast, depending on the kind of transformation.
use t-sql, or whatever iterative language the database provides, everything else is similiar to pure sql aproach.
PRO: pretty fast since you don't leave the database CON: I don't know t-sql, but if it is anything like PL/SQL it is not the nicest language to do complex transformation.
use a high level language (Java, C#, VB ...): You would load your data into proper business objects manipulate those and store them in the database. Pretty much what you seem to be doing right now, although it sounds there are better ORMs available, e.g. nhibernate
use a ETL Tool: There are special tools for extracting, transforming and loading data. They often support various databases. And have many strategies readily available for deciding if an update or insert is in place.
PRO: Sorry, you'll have to ask somebody else for that, I so far have nothing but bad experience with those tools.
CON: A highly specialized tool, that you need to master. I my personal experience: slower in implementation and execution of the transformation then handwritten SQL. A nightmare for maintainability, since everything is hidden away in proprietary repositories, so for IDE, Version Control, CI, Testing you are stuck with whatever the tool provider gives you, if any.
PRO: Even complex manipulations can be implemented in a clean maintainable way, you can use all the fancy tools like good IDEs, Testing Frameworks, CI Systems to support you while developing the transformation.
CON: It adds a lot of overhead (retrieving the data, out of the database, instanciating the objects, and marshalling the objects back into the target database. I'd go this way if it is a process that is going to be around for a long time.
Building on the last option you could further glorify the architectur by using messaging and webservices, which could be relevant if you have more then one source database, or more then one target database. Or you could manually implement a multithreaded transformer, in order to gain through put. But I guess I am leaving the scope of your question.

I'm with John, SSIS is the way to go for any repeatable process to import large amounts of data. It should be much faster than the 30 hours you are currently getting. You could also write pure t-sql code to do this if the two database are on the same server or are linked servers. If you go the t-sql route, you may need to do a hybrid of set-based and looping code to run on batches (of say 2000 records at a time) rather than lock up the table for the whole time a large insert would take.