I'm new to Cassandra and i would like to know something
I want to store some types of big data in cassandra (boolean, text, double and so on). I would like to know how should i store all these data in Cassandra, all specified data type tables in one keyspace or one data type table in one keyspace?
For example Some_Keyspace (boolean_table, text_table...) or Boolean_Keyspace(boolean_table), Text_Keyspace(text_table)?
Which is better way to avoid overloading and don't decrease the reading and writing speed?
Thank you
Take a look at the free courses on https://academy.datastax.com/courses
Start with the basics, then take a look at the course on data modelling which will explain how to structure your data.
1) You should model your tables to fit your access patterns.
2) Replication Factor is configured by the keyspaces which could impact how you break up tables into keyspaces.
Based on your question, you should do a lot more reading around access patterns and data modeling in cassandra.
Related
In a SQL database, we generally have related information stored in different tables in a database. From what I read from RocksDB document, there's really no clear or 'right' way to represent this kind of structure. So I'm wondering what is the practice to categorize information?
Say I have three types of information, Customer, Product, and Employee. And I want to implement these in RocksDB. Should I use prefix of the key, different column families, or different databases?
Thanks for any suggestion.
You can do it by coming up with some prefix which will mean such table, such column, such id. You could for simplicity store in one column family and definetely in one db since you have atomic operations, snapshots and so on. The better question why would you want to store relational data in nosql db unless you are building something higher level.
By the way, checkout linqdb which is an example of higher-level db where you can store entities, perform linq-style operations and it uses rocksdb underneath.
The way data is organized in key-value store is up the the implementation. There is no "one good way to go" it depends on the underlying key-value store features (is it key ordered in particular).
The same normalization/denormalization technics applies.
I think the piece you are missing about key-value store application design is the concept of key composition. Quickly, is the practice of building keys in such a way that they are amenable to querying. If the database is ordered then it will also also for prefix/range/scan queries and next/previous navigation. This will lead you to build key prefixes in such a way that querying is fast ie. that doesn't require a full table scan.
Expand your research to other key value stores like bsddb or leveldb here on Stack Overflow.
Usually I use this data structure to store information in my DWH Fact Tables
Every month we have requests to add more metrics and dimensions to this fact table so as the number of entries increases this becomes more complex.
I was wondering to migrate this structures to another kind of data structure, something like this:
So I do not have to change my tables structure and delegate the metric calculations to the business layer.
Do anyone know what is the canonical name of these structures? So that I can start looking for documentation.
Plus, if someone know the disadvantages of using this...
Cassandra doesn't have some CQL like like clause.... in MySQL to search a more specific data in database.
I have looked through some data and came up some ideas
1.Using Hadoop
2.Using MySQL server to be my anther database server
But is there any ways I can improve my Cassandra DB performance easier?
Improving your Cassandra DB performance can be done in many ways, but I feel like you need to query the data efficiently which has nothing to do with performance tweaks on the db itself.
As you know, Cassandra is a nosql database, which means when dealing with it, you are sacrificing flexibility of queries for fast read/writes and scalability and fault tolerance. That means querying the data is slightly harder. There are many patterns which can help you query the data:
Know what you are needing in advance. As querying with CQL is slightly less flexible than what you could find in a RDBMS engine, you can take advantage of the fast read-writes and save the data you want to query in the proper format by duplicating it. Too complex?
Imagine you have a user entity that looks like that:
{
"pk" : "someTimeUUID",
"name": "someName",
"address": "address",
"birthDate": "someBirthDate"
}
If you persist the user like that, you will get a sorted list of users in the order they joined your db (you persisted them). Let's assume you want to get the same list of users, but only of those who are named "John". It is possible to do that with CQL but slightly inefficient. What you could do here to amend this problem is to de-normalize your data by duplicating it in order to fit the query you are going to execute over it. You can read more about this here:
http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model
However, this approach seems ok for simple queries, but for complex queries it is somewhat hard to achieve and also, if you are unsure what you are going to query in advance, there is no way you store the data in the proper manner beforehand.
Hadoop comes to the rescue. As you know, you can use hadoop's map reduce to solve tasks involving a large amount of data, and Cassandra data, by my experience, can become very very large. With hadoop, to solve the above example, you would iterate over the data as it is, in each map method to find if the user is named John, if so, write to context.
Here is how the pseudocode would look:
map<data> {
if ("John".equals(data.getColumn("name")){
context.write(data);
}
}
At the end of the map method, you would end up with a list of all users who are named John. Youl could put a time range (range slice) on the data you feed to hadoop which will give you
all the users who joined your database over a certain period and are named John. As you see, here you are left with a lot more flexibility and you can do virtually anything. If the data you got was small enough, you could put it in some RDBMS as summary data or cache it somewhere so further queries for the same data can easily retrieve it. You can read more about hadoop in here:
http://hadoop.apache.org/
Consider Microsoft SQL Server 2008
I need to create a table which can be created two different ways as follows.
Structure Columnwise
StudentId number, Name Varchar, Age number, Subject varchar
eg.(1,'Dharmesh',23,'Science')
(2,'David',21,'Maths')
Structure Rowwise
AttributeName varchar,AttributeValue varchar
eg.('StudentId','1'),('Name','Dharmesh'),('Age','23'),('Subject','Science')
('StudentId','2'),('Name','David'),('Age','21'),('Subject','Maths')
in first case records will be less but in 2nd approach it will be 4 times more but 2 columns are reduced.
So which approach is more better in terms of performance,disk storage and data retrial??
Your second approach is commonly known as an EAV design - Entity-Attribute-Value.
IMHO, 1st approach all the way. That allows you to type your columns properly allowing for most efficient storage of data and greatly helps with ease and efficiency of queries.
In my experience, the EAV approach usually causes a world of pain. Here's one example of a previous question about this, with good links to best practices. If you do a search, you'll find more - well worth a sift through.
A common reason why people head down the EAV route is to model a flexible schema, which is relatively difficult to do efficiently in RDBMS. Other approaches include storing data in XML fields. This is one reason where NOSQL (non-relational) databases can come in very handy due to their schemaless nature (e.g. MongoDB).
The first one will have better performance, disk storage and data retrieval will be better.
Having attribute names as varchars will make it impossible to change names, datatypes or apply any kind of validation
It will be impossible to index desired search actions
Saving integers as varchars will use more space
Ordering, adding or summing integers will be a headache, and will have bad performance
The programming language using this database will not have any possibility to have strong typed data
There are many more reasons for using the first approach.
I have implicitly made this a community wiki seeing that the answers can be quite broad.
I'm working with a start-up company to accomplish the following goal.
In a medical research, a patient medical record can have infinite amount of data regarding a patient for a specific diagnosis, e.g. a smoker has a higher chance of catching lung cancer but that doesn't necessarily mean that a non-smoker can catch lung cancer. My goal is to create/use a database model that can deal with such parameters.
Now, I also have to come up with ways to data mine these parametrized data to create statistical data e.g. see the trends on all 40 year old female who suffered from lung cancer. That report can be generic, (graph, tabular, etc.) where doctors can see trends or analyse possible solutions that can work....
My questions are:
1) Which Database systems allows for parametrized backend storage (e.g. Cassandra) that can easily be used in java, and is very efficient in data retrieval, linkage, etc. We are dealing with high amount of patient records per states.
2) What algorithms or AI techniques can I use for data mining? Is there any mining techniques out there that can help me do this?
PS How does Google Analytics deal with parametrised data?
PPS A parametrized data is data which has a key, and data where data can be value, another key-value pair, a list of value, a set of parametrized data (organized, unorganized)
I'm looking forward for suggestive answers! :-D
I'll try to answer your first question only.
Cassandra is a key-value datastore (in your case parametrized). If you use Cassandra, you need higher computation time to derive complex reports. The reason being - it stores data in raw format. Cassandra like NOSQL databases are good if you want to scale very very big. They are eventually consistent and compromise on data replication and latency.
In your case as a patient can have data in infinitely any form, try to fit the model of a Triple Store (Semantic Web frameworks like Jena, OpenSesame, etc). They allow you to have a lousy data structures and can be molded at runtime. Also, their querying engines (SPARQL, SeRQL) give you more power than NOSQL stores (like Cassandra), but these querying capabilities are obviously lesser than RDBMS.
For this question, this is how we have implemented this.
We created a keyspace called medical and a supercolumn family called patient.
under the supercolumn family, we have a general supercolumn which basically store the patient details, and another supercolumn called operation to keep recording of the user occupation.
Don't forget that the general supercolumn keeps record of the patient as he/she comes to the doctor. That way, we know exactly the patient's exact condition before, during and after operation.
I know some data can be duplicates, but no supercolumns can be identical as there is no way that you can have exactly 2 different patient of identical attributes and sickness.
So basically, Cassandra allows 3 layers of abstraction, Keyspace, Column/Supercolumn family, Column/Supercolumn.
Hope this can help somebody.