Could someone explain me why I get next error message:
Mining structure column MyColumn has content type of Ordered that is not
supported by Microsoft Association or Microsoft Naive Bayes algorithms.
Documentation states that (Content Types (Data Mining)):
This content type is supported by all the data mining data types in Analysis
Services. However, however, most algorithms treat ordered values as
discrete values and do not perform special processing.
And specifically for Bayes (Microsoft Naive Bayes Algorithm Technical Reference):
Input attribute: Cyclical, Discrete, Discretized, Key, Table, and Ordered
And another question. What algorithms does the Ordered content type have impact on? I mean if we use Ordered instead of just Discrete.
The ORDERED content type has no impact on any of the SSAS algorithm. It was added to the OLEDB for Data Mining specification in 1999 based on feedback from other data mining vendors. It is possible to write a custom algorithm that will consider the ORDERED flag, but there is no practical way in SQL Server 2005 and beyond to actually order the actual discrete values (there was a way in SQL Server 2000, but it's not practical for 2005+ and likely doesn't work.)
As an aside, the only time ordered discrete states are considered by any SSAS algorithm is when the Clustering algorithm handles discretized values.
Related
I see that Oracle, DB2 and SQL Server contain a new column XML. I'm developing using DB2 and from a database design you can break the 1NF if the xml contains a list.
Am I wrong to assume that SQLXML can break 1NF ?
Thank you,
The relational model is orthogonal to types and places no particular limitations on type complexity. A type could be arbitrarily complex, perhaps containing documents, images, video, etc, as long as all relational operations are supported for relations containing that type. First Normal Form is really just the definition of what a relation schema is, so in principle XML types are permissable by 1NF.
Oracle, DB2 and Microsoft SQL Server are not truly relational however and don't always represent relations and relational operations faithfully. For example SQL Server doesn't support comparison between XML values which means operations like σ(x=x)R or even π(x)R are not possible if x is an XML column. I haven't tried the same with DB2 and Oracle. It is moot whether such tables can properly be said to satisfy 1NF since the XML is implemented as "special" data that doesn't behave as we expect data to behave in relations. Given such limitations I think the important question is whether the proprietary XML type in your chosen DBMS is actually fit for your purposes at all.
The SQL standard defines in its part 14 the XML data type, its semantics and functions around that data type ("SQL/XML"). You could "legally" store few bytes in the XML column or stuff an entire database into a single XML value. It is up to the user and yes, it breaks classic database design. However, if the rest of the database is in 1NF and the XML-typed column is used only for some special payloads (app data, configurations, legal docs, digital signatures, ...) they make a great combination.
There are already other data types and SQL features that allow to break 1NF. Same as above, it is up to the user.
For our project, we need a database that supports JOINs and has the ability to easily add and modify attributes of the entity (schema-less/free). Key points:
The system is designed to work with customers (CRM)
Basic entities: User, Customer, Case, Case Interaction, Order
Currently in the database there are ~200k customers and ~250k orders
Customer entity contains 15-20 optional attributes that are most often not filled
About 100 new cases a day
The data is synchronized with several other sources in the background
Requirements (high to low priority):
Ability to implement search/sort by related entities, e.g. Case by linked Customer name (support JOINs)
Having the flexibility to change the schema of the data and do not store NULL for a large number of attributes
Performance
ORM for Python with support for monitoring changes and the possibility of storing only the changes to the database
What we've tried:
MongoDB does not satisfy paragraph 1.
PostgreSQL with all the attributes in one table does not satisfy paragraph 2.
PostgreSQL with a separate table for each attribute or EAV does not satisfy paragraph 3 (a lot of slow joins), but seems a better solution than others.
Can you suggest any database or design of the system that will meet our needs?
Datomic might be worth checking out (http://www.datomic.com/). It satisfies requirements 1-3, and although there's no python ORM, there is a REST API.
Datomic is based on an Entity Attribute Value schema (it's not quite schema free - you need to specify a name and type for each attribute - but any entity can have any attribute). It is transactional and has support for joins, unlike some of the other flexible "NoSQL" solutions. Interestingly, it also has first-class support for time (e.g. what is the history of this entity/what did the database look like at time t,etc), which might be useful if you're tracking cases and interactions.
Queries are based on datalog, which queries by unification. Query by unification looks a bit odd at first but is brilliant once you get used to it.
For example, a query to find cases by linked customer name would be something like this:
[find ?x
:in $
:where [?x :case/linked-customers ?c
?c :customer/name "Barry"]]
The query engine looks in the database, and tries to satisfy the where clause by unifying all occurrences of a given variable. In this case, only ?c appears twice (the case has a linked customer c whose name is Barry), but queries can obviously get a lot more complex. The $ here represents the database.
You may want to consider storing the "flexible" part as XML. Some databases, e.g. DB2, allow XML indexing so lookup performance should be as good as with the relational data store. DB2 Express-C is free and does not have an artificial limit on the database size.
Update Since 2015 DB2 Express-C limits the database user data volume to 15 TB, which still should be plenty.
Consider Microsoft SQL Server 2008
I need to create a table which can be created two different ways as follows.
Structure Columnwise
StudentId number, Name Varchar, Age number, Subject varchar
eg.(1,'Dharmesh',23,'Science')
(2,'David',21,'Maths')
Structure Rowwise
AttributeName varchar,AttributeValue varchar
eg.('StudentId','1'),('Name','Dharmesh'),('Age','23'),('Subject','Science')
('StudentId','2'),('Name','David'),('Age','21'),('Subject','Maths')
in first case records will be less but in 2nd approach it will be 4 times more but 2 columns are reduced.
So which approach is more better in terms of performance,disk storage and data retrial??
Your second approach is commonly known as an EAV design - Entity-Attribute-Value.
IMHO, 1st approach all the way. That allows you to type your columns properly allowing for most efficient storage of data and greatly helps with ease and efficiency of queries.
In my experience, the EAV approach usually causes a world of pain. Here's one example of a previous question about this, with good links to best practices. If you do a search, you'll find more - well worth a sift through.
A common reason why people head down the EAV route is to model a flexible schema, which is relatively difficult to do efficiently in RDBMS. Other approaches include storing data in XML fields. This is one reason where NOSQL (non-relational) databases can come in very handy due to their schemaless nature (e.g. MongoDB).
The first one will have better performance, disk storage and data retrieval will be better.
Having attribute names as varchars will make it impossible to change names, datatypes or apply any kind of validation
It will be impossible to index desired search actions
Saving integers as varchars will use more space
Ordering, adding or summing integers will be a headache, and will have bad performance
The programming language using this database will not have any possibility to have strong typed data
There are many more reasons for using the first approach.
I am in the architecture stage of an academic project involving billions of records. The project should be very lightweight in terms of computing power and highly scalable.
The information structure is very simple: I need to store a list of items each one with different features. The feature are integers, decimals, dates, strings etc. When the data is imported the types of the feature is known. Also, features can be used to reference other items.
I need to be able to get and sort a list of items by its features (more than one) - possibly using queries such as >, <, =, and regexes, length, left, right, mid for strings between the feature values and against user arbitrary input.
Reporting in the sense of sums, averages, grouping is also necessary by the demands for that are more relaxed - there is not need for a full cube capabilities, but more are better.
I am very new to the whole NoSQL world. What would you recommend?.
If you check out the tutorials for MongoDB, they have, in my opinion, the best introduction to the Map/Reduce system that is used to query and aggregrate.
I do wonder though why you have concluded in advance that NoSQL is the route to go. Although different items may have different schemas, are there a fixed number of entities and attributes, and why have you (if you have) ruled out SQL, which, after all, has decades of accumulated features for storing and querying data.
If you are going to use aggregates then you could use map reduce to populate aggregate tables and then serve that data.
Writing map reduce for every query may be cumbersome, you can also have a look at Apache Pig and Hive. This is especially helpful for the kindly of adhoc queries you are talking about.
http://en.wikipedia.org/wiki/Online_Analytical_Processing
How is this two related? How to know that we are dealing with this type of programme?
The two are often conflated but are not exactly equivalent.
A multi-dimensional database - ie. a star schema:
http://en.wikipedia.org/wiki/Star_schema
(or arguably also a snowflake schema) is a way of organising data into fact tables and dimension tables - the former typically hold the numeric data (ie. measurements) while the latter hold descriptive data. A star schema may be implemented using relational database technology or using specialised storage formats that have been optimised for manipulating dimensional data.
OLAP is normally implemented using specialised storage formats that have been optimised for manipulating dimensional data, and features precalculation of summarised values.
Both are normally used as part of datawarehousing. OLAP is likely to be implemented where performance from a non-aggregated SQL database is judged to be inadequate for aggregated reporting requirements.
What multidimensional usually means in the context of OLAP systems is actually a database design based on "Dimensional Modelling" or software that supports dimensionally modelled data.
The word "Multidimensional" used in that sense is not really very informative because any relational database is inherently multidimensional. (A relation being fundamentally an N-dimensional data structure with the number of dimensions limited only by the constraints of software and hardware). Personally therefore I would prefer to avoid the term multidimensional altogether. It is just too ambiguous to be useful.