I have seen a lot of topics asking for the choice of a database for a voting mechanism,but my inputs are a bit different. I have an application which contains a GUI in which there can be multiple fields/ radio button or a combination of the above. THe GUI is not fixed. Based on the form submitted, the answer XML is dynamically generated.
Thus if there is a form there can be 10000 different people submitting the same form . and i will be having 10000 different forms(numbers will increase).
I now have the following 2 options. Store every xml as it is in the database ( i have not made the choice of using a relational db or a nosql db like mongodb.) or parse the xml and create tables for every form. THat way the number of tables will be huge.
Now , I have to build a voting mechanism which basically looks at all the xml's that have been generated for a particular form i.e 10000 xml's and extract the answers submitted (Note: the xml is complex because 1 form can have multiple answer elements) and then do a vote to find how many people have given the same answer.
My Questions:
Should I use a relational db or NOSQL (MongoDB /Redis or similar ones)?
Do I need to save the xml documents as it is in the db or should I parse it and convert it to tables and save it? Any other approach that I can follow.
I am using JAVA/J2EE for devlepment currenty.
If your question is about how to store data of variable structure, then document database would be pretty handy. As it is schema-less, there will be no issues with rdbms columns maintenance.
Logically this way is pretty similar to storing xml in relational db. The difference is that with rdbms approach, each database reader should have a special xml parsing layer. (Also about xml you refer to Why would I ever choose to store and manipulate XML in a relational database?.)
In general, if you're planning to have a single database client, you can use xml/rdbms.
By the way, instead of storing xml, you can use rdbms in other way - define "generic" structure. For example, you can have "Entities (name, type, id)" table, and "Attributes (entityId, name, type, value)".
If you store XML in the DB - you gain flexibility against performance and maintainability (XML parsing with xpath etc can be verbose and error prone especially with complex and deeply nested XML structures)
If you store tables for each XML - you gain performance, ease of use, complexity against flexibility
Pick a hybrid approach. Store XMLs in a rdbms table as a generic XML structure (as suggested in one of the answers). This way you have fewer tables (less complexity) and avoid all the performance issue of XML parsing.
Related
Suppose that you want to store "tags" on your object (say, a post). With release 9.4 you have 3 main choices:
tags as text[]
tags as jsonb
tags as text (and you store a JSON string as text)
In many cases, 3rd would be out of question since it wouldn't allow query conditional to 'tags' value. In my current development, I don't need such queries, tags are only there to be shown on posts list, not to filter posts.
So, choice is mostly between text[] and jsonb. Both can be queried.
What would you use? And why?
In most cases I would use a normalized schema with a table option_tag implementing the many-to-many relationship between the tables option and tag. Reference implementation here:
How to implement a many-to-many relationship in PostgreSQL?
It may not be the fastest option in every respect, but it offers the full range of DB functionality, including referential integrity, constraints, the full range of data types, all index options and cheap updates.
For completeness, add to your list of options:
hstore (good option)
xml more verbose and more complex than either hstore or jsonb, so I would only use it when operating with XML.
"string of comma-separated values" (very simple, mostly bad option)
EAV (Entity-Attribute-Value) or "name-value pairs" (mostly bad option)
Details under this related question on dba.SE:
Is there a name for this database structure?
If the list is just for display and rarely updated, I would consider a plain array, which is typically smaller and performs better for this than the rest.
Read the blog entry by Josh Berkus #a_horse linked to in his comment. But be aware that it focuses on selected read cases. Josh concedes:
I realize that I did not test comparative write speeds.
And that's where the normalized approach wins big, especially when you change single tags a lot under concurrent load.
jsonb is a good option if you are going to operate with JSON anyway, and can store and retrieve JSON "as is".
I have used both a normalized schema and just a plain text field with CSV separated values instead of custom data types (instead of CSV you can use JSON or whatever other encoding like www-urlencoding or even XML attribute encoding). This is because many ORM's and database libraries are not very good at supporting custom datatypes (hstore, jsonb, array etc).
#ErwinBrandstetter missed a couple of other benefits of normalized one being the fact that it is much quicker to query for all possible previously used tags in a normalized schema than the array option. This is a very common scenario in many tag systems.
That being said I would recommend using Solr (or elasticsearch) for querying for tags as it deals with tag count and general tag prefix searching far better than what I could get Postgres to do if your willing to deal with the consistency aspects of synchronizing with a search engine. Thus the storage of the tags becomes less important.
We have a table in our database that stores XML in one of the columns. The XML is always in the exact same format out of a set of 3 different XML formats which is received via web service responses. We need to look up information in this table (and inside of the XML field) very frequently. Is this a poor use of the XML datatype?
My suggestion is to create seperate tables for each different XML structure as we are only talking about 3 with a growth rate of maybe one new table a year.
I suppose ultimately this is a matter of preference, but here are some reasons I prefer not to store data like that in an XML field:
Writing queries against XML in TSQL is slow. Might not be too bad for a small amount of data, but you'll definitely notice it with a decent amount of data.
Sometimes there is special logic needed to work with an XML blob. If you store the XML directly in SQL, then you find yourself duplicating that logic all over. I've seen this before at a job where the guy that wrote the XML to a field was long gone and everyone was left wondering how exactly to work with it. Sometimes elements were there, sometimes not, etc.
Similar to (2), in my opinion it breaks the purity of the database. In the same way that a lot of people would advise against storing HTML in a field, I would advise against storing raw XML.
But despite these three points ... it can work and TSQL definitely supports queries against it.
Are you reading the field more than you are writing it?
You want to do the conversion on whichever step you do least often or the step that doesn't involve the user.
I am doing a project which need to store 30 distinct fields for a business logic which later will be used to generate report for each
The 30 distinct fields are not written at one time, the business logic has so many transactions, it's gonna be like:
Transaction 1, update field 1-4
Transaction 2, update field 3,5,9
Transaction 3, update field 8,12, 20-30
...
...
N.B each transaction(all belong to one business logic) would be updating arbitrary number of fields & not in any particular order.
I am wondering what's my database design would be best:
Have 30 columns in postgres database representing those 30 distinct
field.
Have 30 filed store in form of xml or json and store it in just one
column of postgres.
1 or 2 which one is better ?
If I choose 1>:
I know for programming perspective is easier Because in this way I don't need to read the overall xml/json and update only a few fields then write back to database, I can only update a few columns I need for each transaction.
If I choose 2>:
I can potentially generic reuse the table for something else since what's inside the blob column is only xml. But is it wrong to use the a table generic to store something totally irrelevant in business logic just because it has a blob column storing xml? This does have the potential to save the effort of creating a few new table. But is this kind of generic idea of reuse a table is wrong in a RDBMS ?
Also by choosing 2> it seem I would be able to handle potential change like change certain field /add more field ? At least it seems I don't need to change database table. But I still need to change c++ & c# code to handle the change internally , not sure if this is any advantage.
I am not experiences enough in database design, so I cannot make the decision which one to choose. Any input is appreciated.
N.B there is a good chance I probabaly don't need to do index or search on those 30 columsn for now, a primary key will be created on a extra column is I choose 2>. But I am not sure if later I will be required to do search based on any of those columns/field.
Basically all my fields are predefined from requirement documents, they generally like simple field:
field1: value(max len 10)
field2: value(max len 20)
...
field20: value((max len 2)
No nest fields. Is it worth to create 20 columns for each of those fields(some are string like date/time, some are string, some are integer etc).
2>
Is putting different business logic in a shared table a bad design idea? If it only being put in a shared table because they share the same structure? E.g. They all have Date time column , a primary key & a xml column with different business logic inside ? This way we safe some effort of creating new tables... Is this saving effort worth doing ?
Always store your XML/JSON fields as separate fields in a relational database. Doing so you will keep your database normalized, allowing the database to do its thing with queries/indices etc. And you will save other developers the headache of deciphering your XML/JSON field.
It will be more work up front to extract the fields from the XML/JSON and perhaps to maintain it if fields need to be added, but once you create a class or classes to do so that hurdle will be eliminated and it will more than make up for the cryptic blob field.
In general it's wise to split the JSON or XML document out and store it as individual columns. This gives you the ability to set up constraints on the columns for validation and checking, to index columns, to use appropriate data types for each field, and generally use the power of the database.
Mapping it to/from objects isn't generally too hard, as there are numerous tools for this. For example, Java offers JAXB and JPA.
The main time when splitting it out isn't such a great idea is when you don't know in advance what the fields of the JSON or XML document will be or how many of them there will be. In this case you really only have two choices - to use an EAV-like data model, or store the document directly as a database field.
In this case (and this case only) I would consider storing the document in the database directly. PostgreSQL's SQL/XML support means you can still create expression indexes on xpath expressions, and you can use triggers for some validation.
This isn't a good option, it's just that EAV is usually an even worse option.
If the document is "flat" - ie a single level of keys and values, with no nesting - the consider storing it as hstore instead, as the hstore data type is a lot more powerful.
(1) is more standard, for good reasons. Enables the database to do heavy lifting on things like search and indexing for one thing.
I have a lot of xsd files with different complex types on it. I want to import data into my oracle database, but amount of data so huge and i can't use xsd2db or altova xmlspy because it's blowing my mind. I'm looking for simple and useful etl tool which can help me with it. Does anyone know gui tool to generate ddl by xsd?
This is a follow up to my comment; I am not positioning this as an answer, but it should help you understand more what is it that you are after, and maybe what you can do about it. For sure, it should be a good example for #a_horse_with_no_name...
I am not familiar with xmlspy, but given what I saw in xsd2db, it made me think of the .NET ability to infer a DataSet from an XML Schema. While the authoring style of the XSD itself may affect the way a DataSet is derived, it would be mostly insignificant for larger bodies of XSD. Even more, there is a big chance that the derivation might not even work (so many limitations there actually).
From my own experience, the derivation process in .NET gives you a very normalized structure. To illustrate, I am going to introduce a sample XML:
<ShippingManifest>
<Date>2012-11-21</Date>
<InvoiceNumber>123ABC</InvoiceNumber>
<Customer>
<FirstName>Sample</FirstName>
<LastName>Customer</LastName>
</Customer>
<Address>
<UnitNumber>2A</UnitNumber>
<StreetNumber>123</StreetNumber>
<StreetName>A Street</StreetName>
<Municipality>Toronto</Municipality>
<ProvinceCode>ON</ProvinceCode>
<PostalCode>X9X 9X9</PostalCode>
</Address>
<PackingList>
<LineItem>
<ID>Box1</ID>
<Load>1-233</Load>
<Description>Package box</Description>
<Items>22</Items>
<Cartons>22</Cartons>
<Weight>220</Weight>
<Length>10</Length>
<Width>10</Width>
<Height>10</Height>
<Volume>1000</Volume>
</LineItem>
<LineItem>
<ID>Box2</ID>
<Load>456-233</Load>
<Description>Package box</Description>
<Items>22</Items>
<Cartons>22</Cartons>
<Weight>220</Weight>
<Length>10</Length>
<Width>10</Width>
<Height>10</Height>
<Volume>1000</Volume>
</LineItem>
</PackingList>
</ShippingManifest>
Conceptually, its structure is very simple: a shipping manifest entity, a customer, a shipping address and a packing list.
Converting this to an ADO.NET DataSet is a straight forward exercise, with a very clean output.
It should be easy to imagine how the number of entities (tables in your database if you wish) could mushroom for just a bit more complex XML...
As a sidebar, if one designs the XSD keeping in mind a process involving DataSets, then removing the PackingList element and moving the LineItem collection as repeating under ShippingManifest gives a somewhat simplified layout: one without the PackingList entity.
Automatic tools to convert an XSD data model to a relational model, such as .NET's, are typically designed to generate a highly normalized structure. The denormalization, I guess, is left to the user for obvious reasons.
QTAssistant's XML Builder is different. Our requirement was to create an ER model which would work where .NET's XSD to Dataset doesn't, and with an output containing smaller number of entities, where possible. This is what QTAssistant generates for the same:
What QTAssistant did here, was to merge all entities engaged in a one-to-one relationship. From a modeling perspective, it's an obvious sin. It does have its benefits, particularly for our users interested in a simple structure capable of capturing data (test data to be more specific).
The generated mapping (XSD to ER) is bi-directional. It means that it can be used to generate valid XML from a database, or "shred" XML data into a database (the shredding is done by generating DML statements). The way this technology is used: test cases are stored in Excel spreadsheets, XML is generated, sent to a web service, and the results are stored back in Excel.
We also generate an XML file describing the structure which through an XSLT could be converted to DDL. And this is where things could get messy, depending on you schema. It is rather common to see XSDs where simple types are unconstrained: strings without maxlength, or using patterns without a maximum length; unconstrained decimals, etc. These are just some of the reasons why, in our case, we don't have an out of the box-straight-forward-way to generate the DDL but rather provide hooks for customizations.
So, to close my comment, I pretty much know what you want to do (I have to assume that other things, such as Oracle's XML capabilities, or XML databases and XQuery, etc. have been ruled out). Unfortunately, the XSD really matters here, so if you can share those as per my comment, I can take a look - it'll be up to you how much you want to share back here.
Consider Microsoft SQL Server 2008
I need to create a table which can be created two different ways as follows.
Structure Columnwise
StudentId number, Name Varchar, Age number, Subject varchar
eg.(1,'Dharmesh',23,'Science')
(2,'David',21,'Maths')
Structure Rowwise
AttributeName varchar,AttributeValue varchar
eg.('StudentId','1'),('Name','Dharmesh'),('Age','23'),('Subject','Science')
('StudentId','2'),('Name','David'),('Age','21'),('Subject','Maths')
in first case records will be less but in 2nd approach it will be 4 times more but 2 columns are reduced.
So which approach is more better in terms of performance,disk storage and data retrial??
Your second approach is commonly known as an EAV design - Entity-Attribute-Value.
IMHO, 1st approach all the way. That allows you to type your columns properly allowing for most efficient storage of data and greatly helps with ease and efficiency of queries.
In my experience, the EAV approach usually causes a world of pain. Here's one example of a previous question about this, with good links to best practices. If you do a search, you'll find more - well worth a sift through.
A common reason why people head down the EAV route is to model a flexible schema, which is relatively difficult to do efficiently in RDBMS. Other approaches include storing data in XML fields. This is one reason where NOSQL (non-relational) databases can come in very handy due to their schemaless nature (e.g. MongoDB).
The first one will have better performance, disk storage and data retrieval will be better.
Having attribute names as varchars will make it impossible to change names, datatypes or apply any kind of validation
It will be impossible to index desired search actions
Saving integers as varchars will use more space
Ordering, adding or summing integers will be a headache, and will have bad performance
The programming language using this database will not have any possibility to have strong typed data
There are many more reasons for using the first approach.