I normally design databases entity–relationship model with Visio and then the relationship model. But now I have to design a DB with a lot of entities and relationsship.
Which end up looking very confusing in Visio, so my question is, are there any programms which allow displaying large db models easly or are there any design models for large dbs.
Check this post: http://www.datasciencecentral.com/profiles/blogs/top-6-data-modeling-tools
Usually, you don't have to visualize the whole Entities and Relationships in a single diagram. You partition your entities into smaller concepts/modules, etc. The repository (of a tool that you use, I believe including visio) would have all the entities you defined, and a selected set of them are visualized in a diagram
Related
I've been learning about Data warehousing concepts and I found these 2 topics little confusing. I've read multiple blog posts and I understood that data modelling consists of three steps
Conceptual Data Model
Logical Data Model
Physical Data Model
and in data warehousing we need to perform certain steps:
Step 1: Identify the dimensions
Step 2: Identify the measures
Step 3: Identify the attributes or properties of dimensions
Step 4: Identify the granularity of the measures
Are these modelling techniques related to each other? If yes, how are this related.
If someone asks, how to design a data warehouse, what should be the correct answer. Where does these modelling techniques comes in while designing a data warehouse.
It would be really helpful, if someone could provide me any link/resource about data modelling and dimensional modelling scenarios.
As the name suggests, a conceptual model is very high level and does not correspond directly to what actually gets built. Logical/physical models do correspond to what you are actually going to build - the difference between the two is that a logical model is system-independent while a physical model is tied to the platform/DB where it is going to be deployed. However they are fundamentally identical in that most modelling tools can automatically generate a physical model from a logical one (and vice versa).
A dimensional model is a type of logical/physical model, in the same way that OLTP, Inmon, Data Vault, etc. are types of logical/physical model. There are normally best practices defined for the steps required to design each of these model types - and you have listed the steps specific to designing a Dimensional model.
So for a given data domain (e.g. a Sales organisation), you would normally have a single Conceptual model and then multiple logical/physical models. Usually these would be one transactional model and one analytical model; the transactional model could be OLTP or NoSQL (or whatever suits your requirements/technology the best); the analytical model could be Dimensional, Inmon, Graph, etc. - again whatever suits your data/analytical requirements the best.
I know there is no easy answer to this question, but how do I cleanup a database with no relationships, foreign keys, and not a whole lot of structure?
I'm an amateur to SQL, and I've inherited a database that is complete mess. We have no sort of referential integrity, and there's not a whole lot of logic to how tables are working.
My database is all data that comes from a warehouse that builds servers.
To give you an idea of the type of data I'm working with:
EDI from customers
Raw output from server projects
Sales information
Site information
Parts lists
I have been prioritizing Raw output and EDI information, and generating reports with that information using SSRS. I have learned a lot about SQL Server and the BI Microsoft tools (SSIS and SSRS) in my short time doing this. However, I'm still an amateur and I want to build a solid database that flows well and can stand on its own.
It seems like a data warehouse model is the type of structure I should adapt.
My question how do I take my mess of a database and make something more organized before I drown in data?
Since your end goal appears to be business reporting, and you're dealing with data from multiple sources made up from "isolated" tables, I would advise you to start by aggregating all that into a data model.
Personally, I would design a dimensional model to structure and store all that data, with the goal of being easy to understand (for reporting or adhoc querying). The model should be focused on business entities and their transactions. In a dimensional model, the business entities will (almost always) be the dimensions and the transactions (the metrics) will be the facts. For example, without knowing your model I'm guessing that the immediate entities would include Customer, Site, Part and transactions would include ServerSale, SiteVisit, PartPurchase, PartRepair, PartOrder, etc...
More information about dimensional modelling here and here, but I suggest going straight to the source: https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-toolkit/
When your model is designed (and implemented in a database like SQL Server) you'll then be loading data into the model, by extracting it from its different source systems/databases and transforming it from the current structure into the structure defined by the model, namely by using an ETL tool like MS Integration Services. For example, your Customer data may be scattered across the "sales", "customer" and "site", so you want to aggregate all that data and load it into a single Customer dimension table. It's when doing this ETL that you should check your data for the problems you already mentioned, loading correct rows into you data model and discarding incorrect rows into a file/log where they can later be checked and corrected. (multiple ways to address this).
A straightforward tutorial to get started on doing ETL using SSIS can be found at https://technet.microsoft.com/en-us/library/jj720568(v=sql.110).aspx
So, to sum up, you should build a data mart:
design a dimensional model that represents the business facts and
context on the data you have. This will strongly facilitate both data understanding and reporting, because a dimensional model is closely matches business users terminology and mental models.
use an ETL tool to extract the data from its current source, process it (e.g. check for data quality problems, join data from different sources) and load it into the dimensional model and check it for problems. This will get you close to having an automated data integration job/pipeline with quality checks you deem fit for the data.
I want to have a reviews-like website, but not only with reviews, other types of content as well. The design of the website combines both hierarchical structure (each content object/record/entity has a parent - kind of container), and relations - each content object/record/entity has a number of related other objects:
an author of the content (i.e. user)
related comments (with their own relations, particularly authors)
item being reviewed as a separate record in DB
images from the gallery
One of the most important things is performance. Relations used to be inefficient in the NoSQL, as I've read on the net and already tried out with other projects. On the other hand, the general design, apart from the relations mentioned, has an obvious content repository like structure, which is the exact reflection of hierarchical arrangement of objects (documents, articles, reviews) websites are designed. Also, I really like the loose structure of the records in NoSQL. Yet, I don't care about (nor use) things like versioning and other things related to NoSQL.
So I want to combine both wordls: hierarchical and relational within one project, or actually, its model. Apart from it, I want the project to be restful, so that a mobile apps could use the same content available through the API. Another requirement is that the content should be searchable.
What type of storage would you choose for a project like this?
I decided to go with the Graph DBs. Here's why I rejected the other ones:
I don't want to use NoSQL (Documents), since relations are hard to maintain and often require extra code infrastructure (often custom) to handle them, see e.g. Diaspora NoSQL problems
I don't want to use RDBMS, since the structure based DBs impose well known limitations and doesn't reflect the domain
I rejected the key-value and big table DBs as they have very specific use cases
Graph Databases have been used in number of content-oriented projects, and appeared to be doing the job surprisingly well.
You can easily model a hierarchical data structure in SQL with the following (using PostgreSQL):
CREATE TABLE comments (
id INTEGER,
parent INTEGER,
content VARCHAR(1024)
)
Where parent refers to the id of the parent comment.
If you are after a NoSQL database that exposes a RESTful interface, you could consider CouchDB.
You can then replicate CouchDB to Elasticsearch for more robust searching.
But if your data is relational then I would very much recommend you consider a SQL database like PostgreSQL first.
Given a problem specification, how to tell if it is a database design problem or class design(object oriented design) problem?
What comes to my mind, is that in OOP, classes(objects) contain methods, whereas a database is just a collection of relationships and values.
Therefore:
If you can say a problem is about how "things" in the specification relate to each other you have a database design problem.
If it is about what the "things" in the specification can do, you're going to be modeling more along object oriented programming.
If you're using a database and creating domain objects, it's both. Database design and class design are two different things, and both are necessary if you're using a database and classes. It's not like you choose one or the other.
This is where an ORM comes into play. When your data layer retrieves information from the database, a typical approach is to transform the relational data into your domain object(s) and pass that to the business logic layer so the rest of your application can deal with domain objects instead of a relational model.
Then your ORM does the opposite when persisting data: it takes a domain entity and turns it back into a relational structure that can be saved to the database.
Note: I'm assuming a relational database here. If not, substitute relational for whatever type of persistence layer you're using.
I believe that the only specifications which should be addressed as database-oriented problems are those which are focused on the manipulation of structured data types. If your specification is all about "store a customer record", "delete an order record", "change the value of price from 12 to 33 for record matching specifcation", you've got a database project.
I haven't seen that kind of problem specification since the Cobol team I worked in employed a systems ~~anarchist~~ analyst. Almost every project I've worked on since has had requirements that were not about how data was stored, but what the data meant.
If you get a requirement that says "Users may create Customers. Customers can place orders. Orders contain products. Orders can have delivery methods, payment methods, and status. Status follows a business process", you have an OO problem. You probably need a storage mechanism - and a database would be an excellent choice - but you have business logic that cannot be exclusively implemented by creating structured data types and relationships.
I love the flexible schema capabilities of CouchDB and MongoDB, but I also love the relational 'join' capability of SQL Server. What I really want is the ability to have tables such as PERSON, COMPANY and ORDER that are basically 'open-schema' where each table has an ID but the rest of the columns are defined json-style {ID:12,firstname:"Pete",surname:"smith",height:"180"}, but where I can efficiently join PERSON to COMPANY either directly or via a many-to-many xref table. Does anyone know if SQL Server has any plans to incorporate 'open schema' in SQL, or whether Mongo or Couch have plans to support efficient joining? Thanks very much.
CouchDB offers a number of ways to establish relationships between your various documents/entities. Check out this article on the wiki to get started.
The tendency, when coming from a relational background, is to continue using the same terminology and mindset whenever you try to solve problems. It's very important to understand that NoSQL solutions are very different, otherwise they have no real purpose for existing. You should really seek to understand how these various NoSQL solutions work so you can compare them with your application's requirements to see if it's an appropriate fit.
MongoDB = NoSQL = No Joins - never ever.
If you need JOINs due to your data model or project requirements: stay with a RDBMS.
Alternatives in MongoDB:
denormalization
using embedded documents
multiple queries
As much as this would be inefficient to Query on a large scale, from a technical standpoint, using the XML datatype would allow you to store whatever structure you wanted that can vary by row.
Not that I'm aware of, but it's not that hard to role your own EAV, it's only 3 tables after all :)
Entity stores the associated table name.
Attribute stores the column name, data type and whether it's nullable.
Value contains one nullable column for each required data type.
Entity 1..* Attribute 1..* Values
Assuming you're using .NET, define your EAV interfaces, create some POCO's and let Entity Framework or your ORM of choice wire up the associations for you. LINQ works great for this sort of operation.
It also allows you to create a hybrid model, where parts of the schema are known but you still want flexibility for custom data. If you design your domain model with this in mind (i.e. use the EAV interfaces in your model) the EAV can be baked in to the EF data context (or whatever) to automate the loading of attributes and their values for each entity. Your EF entity just needs to know which table entity it belongs to.
Of course it's not the perfect solution, as you're (potentially) trading performance for flexibility. Depending on the amount of data you want to persist and the performance requirements, it may be more suited to models where most of the schema is known and a smaller percentage is unknown. YMMV.