Database choice for crawled page semantics

Database choice for crawled page semantics - database

I'm not sure whether this question has already been asked in the past.
I'm writing a webcrawler, intended to extract information from multiple websites for promotions,prices and product descriptions.
Which database choice would be ideal to do an in memory comparison on the data of promotions and prices, based on identification of the same product from multiple websites.
I know the design is going to be complex for the Scraper, HTMLDataProcessor and Storage for wrangling. But, I'm looking for a solve for the data layer choice.
Appreciate the help on this.

I'd suggest first you create your object model or Entity relationship diagram for all the entities.(a.k.a ER diagram)
For instance you can see the tutorial here: http://creately.com/blog/diagrams/er-diagrams-tutorial/
Once you have the diagram and relationships between your entity then you can make a choice of whether you need relational database or not.
You need to answer question like:
Do you care about FK (foreign key) constraints?
What is the most common query and do you care about it's performance?
Is an in-memory database sufficient or do you need data to be persisted?
Think along those lines.

Related

How to design a database that can handle unknown reports?

I am working on a project which stores large amounts of data on multiple industries.
I have been tasked with designing the database schema.
I need to make the database schema flexible so it can handle complex reporting on the data.
For example,
what products are trending in industry x
what other companies have a similar product to my company
how is my company website different to x company website
There could be all sorts of reports. Right now everything is vague. But I know for sure the reports need to be fast.
Am I right in thinking my best path is to try to make as many association tables as I can? The idea being (for example) if the product table is linked to the industry table, it'll be relatively easy to get all products for a certain industry without having to go through joins on other tables to try to make a connection to the data.
This seems insane though. The schema will be so large and complex.
Please tell me if what I'm doing is correct or if there is some other known solution for this problem. Perhaps the solution is to hire a data scientist or DBA whose job is to do this sort of thing, rather than getting the programmer to do it.
Thank you.

I think getting these kinds of answers from a relational/operational database will be very difficult and the queries will be really slow.
The best approach I think will be to create multidimensional data structures (in other words a data warehouse) where you will have flattened data which will be easier to query than a relational database. It will also have historical data for trend analysis.
If there is a need for complex statistical or predictive analysis, then the data scientists can use the data warehouse as their source.

Adding to Amit's answer above, the problem is that what you need from your transactional database is a heavily normalized association of facts for operational purposes. For an analytic side you want what are effectively tagged facts.
In other words what you want is a series of star schemas where you can add whatever associations you want.

Good place to look for example Database Designs - Best practices [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I have been given the task to design a database to store a lot of information for our company. Because the task is rather big and contains multiple modules where users should be able to do stuff, I'm worried about designing a good data model for this. I just don't want to end up with a badly designed database.
I want to have some decent examples of database structures for contracts / billing / orders etc to combine those in one nice relational database. Are there any resources out there that can help me with some examples regarding this?

Barry Williams has published a library of about six hundred data models for all sorts of applications. Almost certainly it will give you a "starter for ten" for all your subsystems. Access to this library is free so check it out.
It sounds like this is a big "enterprise-y" application your organisation wants, and you seem to be a bit of a beginner with databases. If at all possible you should start with a single sub-system - say, Orders - and get that working. Not just the database tables build but some skeleton front-end for it. Once that is good enough add another, related sub-system such as Billing. You don't want to end up with a sprawling monster.
Also make sure you have a decent data modelling tool. SQL Power Architect is nice enough for a free tool.

Before you start read up on normalization until you have no questions about it at all. If you only did this in school, you probably don't know enough about it to design yet.
Gather your requirements for each module carefully. You need to know:
Business rules (which are specific to applications and which must be enforced in the database because they must be enforced on all records no matter the source),
Are there legal or regulatory concerns (HIPAA for instance or Sarbanes-Oxley requirements)
security (does data need to be encrypted?)
What data do you need to store and why (is this data available anywhere else)
Which pieces of data will only have one row of data and which will need to have multiple rows?
How do you intend to enforce uniqueness of the the row in each table? Do you have a natural key or do you need a surrogate key (suggest a surrogate key in almost all cases)?
Do you need replication?
Do you need auditing?
How is the data going to be entered into the database? Will it come from the application one record at a time (or even from multiple applications)or will some of it come from bulk inserts from an ETL tool or from another database.
Do you need to know who entered the record and when (highly likely this will be necessary in an enterprise system.
What kind of lookup tables will you need? Data entry is much more accurate when you can use lookup tables and restrict the users to the values.
What kind of data validation do you need?
Roughly how many records will the system have? You need to have an idea to know how big to create your test data.
How are you going to query the data? Will you be using stored procs or an ORM or dynamic queries?
Some very basic things to remember in your design. Choose the right data type for your data. Do not store dates or numbers you intend to do math on in string fields. Do store numbers that are not candidates for math (part numbers, zip codes, phone numbers, etc) as string data as you may need leading zeros. Do not store more than one piece of information in a field. So no comma-concatenated lists (these indicate the need for a related table) and while you are at it if you find yourself doing something like phone1, phone2, phone 3, stop right away and design a related table. Do use foreign keys for data integrity purposes.
All the way through your design consider data integrity. Data that has no integrity is meaningless and useless. Do design for performance, this is critical in database design and is NOT premature optimization. Database do not refactor easily, so it is important to get the most critical parts of the performance equation right the first time. In fact all databases need to be designed for data integrity, performance and security.
Do not be afraid to have multiple joins, properly indexed these will perform just fine. Do not try to put everything into an entity value type table. Use these as sparingly as possible. Try to learn to think in terms of handling sets of data, it will help your design. Databases are optimized to do things in sets.
There's more but this is enough to start digesting.

Try to keep your concerns separate here. Users being able to update the database is more of an "application design" problem. If you get your database design right then it should be a case of developing a nice front end for it.
First thing to look at is Normalization. This is the process of eliminating any redundant data from your tables. This will help keep your database neat, and only store information that is relevant to your needs.

The Data Model Resource Book.
http://www.amazon.com/Data-Model-Resource-Book-Vol/dp/0471380237/ref=dp_cp_ob_b_title_0
HEAVY stuff, but very well through out. 3 volumes all in all...
Has a lot of very well through out generic structures - but they are NOT easy, as they cover everything ;) Always a good starting point, though.

The database should not be the model. It is used to save informations between sessions of work.
You should not build your application upon a data model, but upon a good object oriented model that follows business logic.
Once your object model is done, then think about how you can save and load it, with all the database design that goes with it.
(but apparently your company just want you to design a database ? not an application ?)

How would you design your database to allow user-defined schema

If you have to create an application like - let's say a blog application, creating the database schema is relatively simple. You have to create some tables, tblPosts, tblAttachments, tblCommets, tblBlaBla… and that's it (ok, i know, that's a bit simplified but you understand what i mean).
What if you have an application where you want to allow users to define parts of the schema at runtime. Let's say you want to build an application where users can log any kind of data. One user wants to log his working hours (startTime, endTime, project Id, description), the next wants to collect cooking recipes, others maybe stock quotes, the weekly weight of their babies, monthly expenses they spent for food, the results of their favorite football teams or whatever stuff you can think about.
How would you design a database to hold all that very very different kind of data? Would you create a generic schema that can hold all kind of data, would you create new tables reflecting the user data schema or do you have another great idea to do that?
If it's important: I have to use SQL Server / Entity Framework

Let's try again.
If you want them to be able to create their own schema, then why not build the schema using, oh, I dunno, the CREATE TABLE statment. You have a full boat, full functional, powerful database that can do amazing things like define schemas and store data. Why not use it?
If you were just going to do some ad-hoc properties, then sure.
But if it's "carte blanche, they can do whatever they want", then let them.
Do they have to know SQL? Umm, no. That's your UIs task. Your job as a tool and application designer is to hide the implementation from the user. So present lists of fields, lines and arrows if you want relationships, etc. Whatever.
Folks have been making "end user", "simple" database tools for years.
"What if they want to add a column?" Then add a column, databases do that, most good ones at least. If not, create the new table, copy the old data, drop the old one.
"What if they want to delete a column?" See above. If yours can't remove columns, then remove it from the logical view of the user so it looks like it's deleted.
"What if they have eleventy zillion rows of data?" Then they have a eleventy zillion rows of data and operations take eleventy zillion times longer than if they had 1 row of data. If they have eleventy zillion rows of data, they probably shouldn't be using your system for this anyway.
The fascination of "Implementing databases on databases" eludes me.
"I have Oracle here, how can I offer less features and make is slower for the user??"
Gee, I wonder.

There's no way you can predict how complex their data requirements will be. Entity-Attribute-Value is one typical solution many programmers use, but it might be be sufficient, for instance if the user's data would conventionally be modeled with multiple tables.
I'd serialize the user's custom data as XML or YAML or JSON or similar semi-structured format, and save it in a text BLOB.
You can even create inverted indexes so you can look up specific values among the attributes in your BLOB. See http://bret.appspot.com/entry/how-friendfeed-uses-mysql (the technique works in any RDBMS, not just MySQL).
Also consider using a document store such as Solr or MongoDB. These technologies do not need to conform to relational database conventions. You can add new attributes to any document at runtime, without needing to redefine the schema. But it's a tradeoff -- having no schema means your app can't depend on documents/rows being similar throughout the collection.
I'm a critic of the Entity-Attribute-Value anti-pattern.
I've written about EAV problems in my book, SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.
Here's an SO answer where I list some problems with Entity-Attribute-Value: "Product table, many kinds of products, each product has many parameters."
Here's a blog I posted the other day with some more discussion of EAV problems: "EAV FAIL."
And be sure to read this blog "Bad CaRMa" about how attempting to make a fully flexible database nearly destroyed a company.

I would go for a Hybrid Entity-Attribute-Value model, so like Antony's reply, you have EAV tables, but you also have default columns (and class properties) which will always exist.
Here's a great article on what you're in for :)
As an additional comment, I knocked up a prototype for this approach using Linq2Sql in a few days, and it was a workable solution. Given that you've mentioned Entity Framework, I'd take a look at version 4 and their POCO support, since this would be a good way to inject a hybrid EAV model without polluting your EF schema.

On the surface, a schema-less or document-oriented database such as CouchDB or SimpleDB for the custom user data sounds ideal. But I guess that doesn't help much if you can't use anything but SQL and EF.

I'm not familiar with the Entity Framework, but I would lean towards the Entity-Attribute-Value (http://en.wikipedia.org/wiki/Entity-Attribute-Value_model) database model.
So, rather than creating tables and columns on the fly, your app would create attributes (or collections of attributes) and then your end users would complete the values.
But, as I said, I don't know what the Entity Framework is supposed to do for you, and it may not let you take this approach.

Not as a critical comment, but it may help save some of your time to point out that this is one of those Don Quixote Holy Grail type issues. There's an eternal quest for probably over 50 years to make a user-friendly database design interface.
The only quasi-successful ones that have gained any significant traction that I can think of are 1. Excel (and its predecessors), 2. Filemaker (the original, not its current flavor), and 3. (possibly, but doubtfully) Access. Note that the first two are limited to basically one table.
I'd be surprised if our collective conventional wisdom is going to help you break the barrier. But it would be wonderful.

Rather than re-implement sqlservers "CREATE TABLE" statement, which was done many years ago by a team of programmers who were probably better than you or I, why not work on exposing SQLSERVER in a limited way to the users -- let them create thier own schema in a limited way and leverage the power of SQLServer to do it properly.

I would just give them a copy of SQL Server Management Studio, and say, "go nuts!" Why reinvent a wheel within a wheel?

Check out this post you can do it but it's a lot of hard work :) If performance is not a concern an xml solution could work too though that is also alot of work.

What are the general guidelines and best practices to keep in mind while designing database for an application? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
My questions is regarding Database Modeling. I tried to look for this question amongst other Database Designing questions on SO but haven't found it and so here am asking about:
What are the general guidelines and best practices to keep in mind while designing database for an application ?
What are the best resources/books/University Lectures available on Database Design Concepts ?
Thanks.

Just some things I've learned from experience (I'm sure some will disagree, but I've been querying and designing and programming databases for 30+years and have seen the effects of stupid design up close and personal):
There are three critical things to consider in all database design - data integrity (without this you essentially have no data), security and performance. All other considerations take a back seat to these three.
Never create a table without a way to uniquely identify a record.
There really are very few true natural keys that really work as a primary key, if you don't have control over whether it will change, do not use it as a primary key (you don't really want to change the company name through 27 child tables do you?). Use a surrogate key instead. Using a surrogate key does not exempt you from the need to set unique indexes if you could have used a unique composite key. Always set these indexes if you can determine a way to have a unique composite. Duplicate records are the bane of an application's existance. It seems obvious but never ever consider name to be a key field, names are not and never will be unique.
Do not use a GUID as your primary key as it can kill performance. If you need a guid for replication also consider having an int or big int primary key.
Do not design as if you will be changing database backends unless you know up front you will be doing so. Virtually all the good techniques for performance tuning are database specific, don't harm your own ability to tune your database for a non-existant requirement.
Avoid value-entity table structures. They are miserable to query.
Add all things you need to ensure data integrity into your database design, things like defaults, constraints, triggers, etc. are necessary to avoid having useless data. Do not rely on the application code to do this or you will be sorry.
Others mentioned normalization, I agree you must understand this thoroughly even if you later decide to denormalize.
Do not stack views on top of views if you want any kind of performance at all. Every database I've seen that does this is eventually a huge performance problem.
Consider what data you will need to manage the database as well as what the application needs. If you are going to be serious about databases you need to understand database auditing and your database should implement ways to find out who made what change and when and what the old data was. You'll thank me the first time someone malicious changes the data or someone deletes all the records in a table accidentally.
Really think through how the data will be queried when designing. It can make a huge difference in the design.
Do not store more than one piece of information in a field. It might look cool to put a comma delimited list into one field rather than add a related table but it is a really bad idea.
Elegance is often the enemy of performance in databases. Pick performance over elegance every time and you won't go wrong.
Avoid the use of database keywords in the naming of objects. Your programmers will thank you. Pick a naming convention and be consistent in always using it. If a field is in mulitple tables make sure it is the same name (exception if an id field has two possible foreign keys in the same table use the id field name and a prefix to identify the differnce between say Sales_person_id and Customer_person_id), same datatype and length, if applicable in all of them. Fix misspellings right away, you really don't want to spend the next ten years remembering that in tablea it is the persnoid instead of personid.
Read about database refactoring (search on amazon for some good books) and consider how to be able to do this in your design. Few databases are designed to be refactored and being able to do so is critical towards being able to fix database problems that arise from badly thought out designs or changes to business requirements.
While you are reading, read about performance tuning, you'll learn a tremendous amount about what to avoid in designing the database.
I'm sure there's more but this is enough to start with.
One addtional thing I wanted to add. Do not design your database as if the data entry application page is the most critical thing. Data is often queried more often than it is written even in a transactional database. Really think about how easy it will be to to get data back out of the database (Oh so that's why the EAV model is so bad!) and what effect the design will have on reporting. This is espcially critical as I often see that the people doing the reporting are not the people who design the database or reporting tasks are later in the project than createing the data entry. Databases are not easy to refactor, consider the whole life cycle of the data when designing a database. Think about things like storing moment in time values as you can't find out how much an order was for two years later by multiplying the quantity ordered by the price in the products table as that wasn't the price at the time of the order. Reporting needs this type if information, but it often too late to get it by the time the reports are written when the design is done badly. Stuff that works fine when you are handling one record at a time can be a disaster when you need to look at thousands or millions of records. Not every application is going to create a separate reporting datbase, so really think about this.

DEPENDS
this question is like saying "what is the best car to buy", it really depends on many factors including amount of data, number of concurrent users, what you are trying to do, etc. FYI, normalization is good for some database uses, but bad for others (data warehouse).
Give us a better idea of how you intend to use the data, and you'll get some better recommendations.

While I agree with others that your question right now is much too broad and can't really be answered (except for the "it depends" approach :-)), there is one book I would wholeheartedly recommend for anyone beginning database design in general:
Michael Hernandez: Database Design for Mere Mortals(R): A Hands-On Guide to Relational Database Design
It's a really hands-on, no-frills, down to earth book and introduces all the major and important concepts in a very understandable, very approachable fashion. Well written, interesting, very sound and useful - highly recommended!
Marc

your question is too broad. Normalization and denormalization are most used concepts.

The best thing to do is to start with a well normalized database. The wikipedia article has some good information on that along with some good references.
Typically you'll end up denormalizing parts of your database for better performance, but you almost always want to start with it in 4th normal form.

Look at wikipedia article about database normalization. There is also further reading section.
If you design a new database for brand new application you should try use ORM library (like JPA implementations in Java) that release you from database design, because these tools generate database from domain model. If you don't have any experience in this field - database generated with ORM tools will be much better of yours.

Consider all your use cases. Think about every single possible way someone might want to get to data, and plan for those. Wear your designer, developer, tester, and user hats.
Try to think of database tables as representing physical objects.
Normalize, as others have said.

Pros/cons of document-based databases vs. relational databases

I've been trying to see if I can accomplish some requirements with a document based database, in this case CouchDB. Two generic requirements:
CRUD of entities with some fields which have unique index on it
ecommerce web app like eBay (better description here).
And I'm begining to think that a Document-based database isn't the best choice to address these requirements. Furthermore, I can't imagine a use for a Document based database (maybe my imagination is too limited).
Can you explain to me if I am asking pears from an elm when I try to use a Document oriented database for these requirements?

You need to think of how you approach the application in a document oriented way. If you simply try to replicate how you would model the problem in an RDBMS then you will fail. There are also different trade-offs that you might want to make. ([ed: not sure how this ties into the argument but:] Remember that CouchDB's design assumes you will have an active cluster of many nodes that could fail at any time. How is your app going to handle one of the database nodes disappearing from under it?)
One way to think about it is to imagine you didn't have any computers, just paper documents. How would you create an efficient business process using bits of paper being passed around? How can you avoid bottlenecks? What if something goes wrong?
Another angle you should think about is eventual consistency, where you will get into a consistent state eventually, but you may be inconsistent for some period of time. This is anathema in RDBMS land, but extremely common in the real world. The canonical transaction example is of transferring money from bank accounts. How does this actually happen in the real world - through a single atomic transactions or through different banks issuing credit and debit notices to each other? What happens when you write a cheque?
So lets look at your examples:
CRUD of entities with some fields with unique index on it.
If I understand this correctly in CouchDB terms, you want to have a collection of documents where some named value is guaranteed to be unique across all those documents? That case isn't generally supportable because documents may be created on different replicas.
So we need to look at the real world problem and see if we can model that. Do you really need them to be unique? Can your application handle multiple docs with the same value? Do you need to assign a unique identifier? Can you do that deterministically? A common scenario where this is required is where you need a unique sequential identifier. This is tough to solve in a replicated environment. In fact if the unique id is required to be strictly sequential with respect to time created it's impossible if you need the id straight away. You need to relax at least one of those constraints.
ecommerce web app like ebay
I'm not sure what to add here as the last comment you made on that post was to say "very useful! thanks". Was there something missing from the approach outlined there that is still causing you a problem? I thought MrKurt's answer was pretty full and I added a little enhancement that would reduce contention.

Is there a need to normalize the data?
Yes: Use relational.
No: Use document.

I am in the same boat, I am loving couchdb at the moment, and I think that the whole functional style is great. But when exactly do we start to use them in ernest for applications. I mean, yes we can all start to develop applications extremely quickly, cruft free with all those nasty hang-ups about normal form being left in the wayside and not using schemas. But, to coin a phrase "we are standing on the shoulders of giants". There is a good reason to use RDBMS and to normalise and to use schemas. My old oracle head is reeling thinking about data without form.
My main wow factor on couchdb is the replication stuff and the versioning system working in tandem.
I have been racking my brain for the last month trying to grok the storage mechanisms of couchdb, apparently it uses B trees but doesn't store data based on normal form. Does this mean that it is really really smart and realises that bits of data are replicated so lets just make a pointer to this B tree entry?
So far I am thinking of xml documents, config files, resource files streamed to base64 strings.
But would I use couchdb for structural data. I don't know, any help greatly appreciated on this.
Might be useful in storing RDF data or even free form text.

A possibility is to have a main relational database that stores definitions of items that can be retrieved by their IDs, and a document database for the descriptions and/or specifications of those items. For example, you could have a relational database with a Products table with the following fields:
ProductID
Description
UnitPrice
LotSize
Specifications
And that Specifications field would actually contain a reference to a document with the technical specifications of the product. This way, you have the best of both worlds.

Document based DBs are best suiting for storing, well, documents. Lotus Notes is a common implementation and Notes email is an example. For what you are describing, eCommerce, CRUD, etc., realtional DBs are better designed for storage and retrieval of data items/elements that are indexed (as opposed to documents).

Re CRUD: the whole REST paradigm maps directly to CRUD (or vice versa). So if you know that you can model your requirements with resources (identifiable via URIs) and a basic set of operations (namely CRUD), you may be very near to a REST-based system, which quite a few document-oriented systems provide out of the box.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight