How to improve the performance for indexing data in cassandra

How to improve the performance for indexing data in cassandra - database

Cassandra doesn't have some CQL like like clause.... in MySQL to search a more specific data in database.
I have looked through some data and came up some ideas
1.Using Hadoop
2.Using MySQL server to be my anther database server
But is there any ways I can improve my Cassandra DB performance easier?

Improving your Cassandra DB performance can be done in many ways, but I feel like you need to query the data efficiently which has nothing to do with performance tweaks on the db itself.
As you know, Cassandra is a nosql database, which means when dealing with it, you are sacrificing flexibility of queries for fast read/writes and scalability and fault tolerance. That means querying the data is slightly harder. There are many patterns which can help you query the data:
Know what you are needing in advance. As querying with CQL is slightly less flexible than what you could find in a RDBMS engine, you can take advantage of the fast read-writes and save the data you want to query in the proper format by duplicating it. Too complex?
Imagine you have a user entity that looks like that:
{
"pk" : "someTimeUUID",
"name": "someName",
"address": "address",
"birthDate": "someBirthDate"
}
If you persist the user like that, you will get a sorted list of users in the order they joined your db (you persisted them). Let's assume you want to get the same list of users, but only of those who are named "John". It is possible to do that with CQL but slightly inefficient. What you could do here to amend this problem is to de-normalize your data by duplicating it in order to fit the query you are going to execute over it. You can read more about this here:
http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model
However, this approach seems ok for simple queries, but for complex queries it is somewhat hard to achieve and also, if you are unsure what you are going to query in advance, there is no way you store the data in the proper manner beforehand.
Hadoop comes to the rescue. As you know, you can use hadoop's map reduce to solve tasks involving a large amount of data, and Cassandra data, by my experience, can become very very large. With hadoop, to solve the above example, you would iterate over the data as it is, in each map method to find if the user is named John, if so, write to context.
Here is how the pseudocode would look:
map<data> {
if ("John".equals(data.getColumn("name")){
context.write(data);
}
}
At the end of the map method, you would end up with a list of all users who are named John. Youl could put a time range (range slice) on the data you feed to hadoop which will give you
all the users who joined your database over a certain period and are named John. As you see, here you are left with a lot more flexibility and you can do virtually anything. If the data you got was small enough, you could put it in some RDBMS as summary data or cache it somewhere so further queries for the same data can easily retrieve it. You can read more about hadoop in here:
http://hadoop.apache.org/

Related

sql | slow queries | avoid many joins

I am currently working with java spring and postgres.
I have a query on a table, many filters can be applied to the query and each filter needs many joins.
This query is very slow, due to the number of joins that must be performed, also because there are many elements in the table.
Foreign keys and indexes are correctly created.
I know one approach could be to keep duplicate information to avoid doing the joins. By this I mean creating a new table called infoSearch and keeping it updated via triggers. At the time of the query, perform search operations on said table. This way I would do just one join.
But I have some doubts:
What is the best approach in postgres to save item list flat?
I know there is a json datatype, could I use this to hold the information needed for the search and use jsonPath? is this performant with lists?
I also greatly appreciate any advice on another approach that can be used to fix this.
Is there any software that can be used to make this more efficient?
I'm wondering if it wouldn't be more performant to move to another style of database, like graph based. At this point the only problem I have is with this specific table, the rest of the problem is simple queries that adapt very well to relational bases.
Is there any scaling stat based on ratios and number of items which base to choose from?

Denormalization is a tried and true way to speed up queries/reports/searching processes for relational databases. It uses a standard time vs space tradeoff to reduce the time of query, at the cost of duplicating the data and increasing write/insert time.
There are third party tools that are specifically designed for this use-case, including search tools (like ElasticSearch, Solr, etc) and other document-centric databases. Graph databases are probably not useful in this context. They are focused on traversing relationships, not broad searches.

Flat data with struct type vs document store

I know this is a 'soft' question, which is usually frowned upon on SO, but I have been using BigQuery to do data analysis on (obviously) flat data, which contains both structs and repeated data. Let's just use a very basic example, a row might look like this:
ID
Title (str)
ReleaseYear (int)
Genres (str[])
Credits (struct[])
And an example piece of data might look like:
{
"ID": "T-1997",
"Title": "Titanic",
"ReleaseYear": 1997,
"Genres": ["Drama", "Romance"],
"Credits": {
"Actors": ["Leonardo DiCaprio", "Kate Winslet"],
"Directors": ["James Cameron"]
}
}
My question is basically what type of operations or queries can be done in a native document store, such as MongoDB or CouchBase, that couldn't be done in a relational DB that supports arbitrarily-nested data. In other words, my assumption (and I hope I'm wrong or misguided) is that as long as a DB supports structs, it can do everything that a document-store can do. If not, what are some places where it is either: (1) something that can be done in MongoDB (or any other document-store) that cannot be done in BigQuery (or any other database that supports structs)? and (2) something that can be done much more easily in MongoDB that in a relational DB?

what type of operations or queries can be done in a native document
store, such as MongoDB or CouchBase, that couldn't be done in a
relational DB that supports arbitrarily-nested data.
Even if does support arbitrarily nested data, BigQuery allows limited nesting compared to MongoDB .MongoDB supports more levels of nesting.
In BigQuery, your schema cannot contain more than 15 levels of nested STRUCTs. MongoDB supports unto 100 levels of nesting for BSON documents.
In other words, my assumption (and I hope I'm wrong or misguided) is
that as long as a DB supports structs, it can do everything that a
document-store can do.
Not exactly - nested columns are columns within columns. But sharding in an RDBMS is a complex endeavor compared to a NoSQL database like Mongo. Technically you can do, but it wasn't designed for the same purpose. Its like using a wrench as a hammer - sure you can, but its purpose was something different. You should use the right tool for the right purpose.
If not, what are some places where it is either: (1) something that
can be done in MongoDB (or any other document-store) that cannot be
done in BigQuery (or any other database that supports structs)? and
(2) something that can be done much more easily in MongoDB that in a
relational DB?
The crux of the matter is, an RDBMS may tack on features to "technically" allow you to do some things that you can do in a NoSQL database. But it doesn't mean it may work just as well. For example, because of the features that make an RDBMS an RDBMS (ACID compliance, transactions etc), there will always be an additional performance hit compared to a NoSQL database. If an RDBMS removes these features, then it is no longer an RDBMS!
This answer illustrates how MongoDB achieves better performance because it doesn't need to support RDBMS features :
https://softwareengineering.stackexchange.com/questions/54373/when-would-someone-use-mongodb-or-similar-over-a-relational-dbms
MongoDB has a lower latency per query & spends less CPU time per query because it is doing a lot less work (e.g. no joins,
transactions).
As a result, it can handle a higher load in terms of queries per second and is thus often used if you have a massive # of users.
MongoDB is easier to shard (use in a cluster) because it doesn't have to worry about transactions and consistency. - MongoDB has a
faster write speed because it does not have to worry about
transactions or rollbacks (and thus does not have to worry about
locking).
MongoDB does not have a schema in case you have a special use case that can take advantage of that.
Another feature is sharding - sharding is easier with mongodb because it doesn't need to support many of the features which make an RDBMS an RDBMS, such as being ACID compliant. In contrast, sharding is complex for an RDBMS because an RDBMS must remain ACID compliant.
Take a look at the following two images:
The speed boat would out perform the "amphibious car" in the water 10/10 times. The amphibious car technically can navigate in water, but it wasn't designed to, hence is much slower and unsuited for its purpose.
Like wise, look at the difference in aerodynamics of the speed boat and this sweet automobile. Even if you tacked on wheels to the boat, its not going to perform as well as this car on land. (As an analogy you could say that NoSQL databases don't do joins - you have to implement them yourself. - but will it perform better than an RDBMS for join heavy operations ?)
The point I'm making with the analogies, is that each kind of database was initially designed for a specific goal, and over time features have been added to try and make it solve problems it was not designed for (hence it doesn't do it as well as something specifically designed for that purpose).
Hence in your question, even if BigQuery or some RDBMS can do something, it doesn't mean that you should use them for the job. The same applies for NoSQL databases. You should use the best tool for the job.

Disclaimer: I don't have experience in MongoDB or CouchBase. My answer is based on BigQuery's capability on STRUCT.
Performance
BigQuery's STRUCT is optimized for query. For example, if you query select a.nested_b.nested_c.nested_d from table_t, the query only scans data for the left STRUCT field nested_d, it is fast and cheap.
Usability
If your data is write-once or append-only, then STRUCT column is comparable with document store AFAIK.
But if you want to update only certain nested field later, nested STRUCT makes it pretty difficult to do, because there is no way to update single item in REPEATED field, you have to load the whole array, scan and change, and repack to update a column. You will be writing something like:
UPDATE table
SET Credits.Actors = (SELECT ARRAY_AGG(...) FROM UNNEST(Credits.Actors) WHERE ...)
WHERE ...
It may become a bigger problem when there is array of struct of arrays (and even more nested levels). Based on my understanding of document store, updating single nested field of a document should be easier than this. Basically, this is kind of the price you have to pay to get the performance benefit mentioned earlier.

The convenient way of the data storage in Cassandra

I'm new to Cassandra and i would like to know something
I want to store some types of big data in cassandra (boolean, text, double and so on). I would like to know how should i store all these data in Cassandra, all specified data type tables in one keyspace or one data type table in one keyspace?
For example Some_Keyspace (boolean_table, text_table...) or Boolean_Keyspace(boolean_table), Text_Keyspace(text_table)?
Which is better way to avoid overloading and don't decrease the reading and writing speed?
Thank you

Take a look at the free courses on https://academy.datastax.com/courses
Start with the basics, then take a look at the course on data modelling which will explain how to structure your data.

1) You should model your tables to fit your access patterns.
2) Replication Factor is configured by the keyspaces which could impact how you break up tables into keyspaces.
Based on your question, you should do a lot more reading around access patterns and data modeling in cassandra.

Polyglot persistece with a graph database for relationships is a good ideia?

I would like to know if worth the idea of use graph databases to work specifically with relationships.
I pretend to use relational database for storing entities like "User", "Page", "Comment", "Post" etc.
But in most cases of a typical social graph based workload, I have to get a deep traversals that relational are not good to deal and involves slow joins.
Example: Comment -(made_in)-> Post -(made_in)-> Page etc...
I'm thinking make something like this:
Example:
User id: 1
Query: Get all followers of user_id 1
Query Neo4j for all outcoming edges named "follows" for node user with id 1
With a list of ids query them on the Users table:
SELECT *
FROM users
WHERE user_id IN (ids)
Is this slow?
I have seen this question Is it a good idea to use MySQL and Neo4j together?, but still cannot understand why the correct answer says that that is not a good idea.
Thanks

Using Neo4j is a great choice of technologies for an application like yours, that requires deep traversals. The reason it's a good choice is two-fold: one is that the Cypher language makes such queries very easy. The second is that deep traversals happen very quickly, because of the way the data is structured in the database.
In order to reap both of these benefits, you will want to have both the relationships and the people (as nodes) in the graph. Then you'll be able to do a friend-of-friends query as follows:
START john=node:node_auto_index(name = 'John')
MATCH john-[:friend]->()-[:friend]->fof
RETURN john, fof
and a friend-of-friend-of-friend query as follows:
START john=node:node_auto_index(name = 'John')
MATCH john-[:friend]->()-[:friend]->()->[:friend]->fofof
RETURN john, fofof
...and so on. (Same idea for posts and comments, just replace the name.)
Using Neo4j alongside MySQL is fine, but I wouldn't do it in this particular way, because the code will be much more complex, and you'll lose too much time hopping between Neo4j and MySQL.
Best of luck!
Philip

In general, the more databases/systems/layers you've got, the more complex the overall setup and operating will be.
Think about all those tasks like synchronization, export/import, backup/archive etc. which become quite expensive if your database(s) grow in size.
People use polyglot persistence only if the benefits of having dedicated and specialized databases outweigh the drawbacks of having to cope with multiple data stores. F.e. this can be the case if you have a large number of data items (activity or transaction logs f.e.), each related to a user. It would probably make no sense to store all the information in a graph database if you're only interested in the connections between the data items. So you would be better off storing only the relations in the graph (and the nodes have just a pointer into the other database), and the data per item in a K/V store or the like.
For your example use case, I would go only for one database, namely Neo4j, because it's a graph.

As the other answers indicate, using Neo4j as your single data store is preferable. However, in some cases, there might not be much choice in the matter where you already have another database behind your product. I would just like to add that if this is the case, running neo4j as your secondary database does work (the product I work on operates in this mode). You do have to work extra hard at figuring out what functionality you expect out of neo4j, what kind of data you need for it,how to keep the data in sync and the consequence of suffering from not always real time results. Most of our use cases can work with near real time results so we are fine. Bit it may not be the case for your product. Still, to me , using neo4j in this mode is still preferable than running without it.
We are able to produce a lot of graphy-great stuff as a result of it.

How do you mix SQL DB vs. Key-Value store (i.e. Redis)

I'm reviewing my code and realize I spend a tremendous amount of time
taking rows from a database,
formatting as XML,
AJAX GET to browser, and then
converting back into a hashed javascript object as my local datastore.
On updates, I have to reverse the process (except using POST instead of XML.)
Having just started looking at Redis, I'm thinking I can save a tremendous amount of time keeping the objects in a key-value store on the server and just using JSON to transfer directly to JS client. But my feeble mind can't anticipate what I'm giving up by leaving a SQL DB (i.e. I'm scared to give up the GROUP BY/HAVING queries)
For my data, I have:
many-many relationships, i.e. obj-tags, obj-groups, etc.
query objects by a combination of such, i.e. WHERE tag IN ('a', 'b','c') AND group in ('x','y')
self joins, i.e. ALL the tags for each object WHERE tag='a' (sql group_concat())
a lot of outer joins, i.e. OUTER JOIN rating ON o.id = rating.obj_id
and feeds, which seem to be a strong point in REDIS
How do you successfully mix key-value & SQL DBs?
For example, is practical to join a large list of obj.Ids from a REDIS set with SQL data using a SQL RANGE query (i.e. WHERE obj.id IN (1,4,6,7,8,34,876,9879,567,345, ...), or vice versa?
ideas/suggestions welcome.

You may want to take a look at MongoDB. It works with JSON style objects, and comes with SQL like indexing & querying. Redis is more suitable for storing data structures likes lists & sets, when you want a simple lookup instead of a complex query.

Now that the actual problem is more defined (i.e. you spend a lot of time writing repetitive conversion code to move from one layer/representation to the next) maybe you could consider writing (or googling for) something that automatizes this, maybe?
Googles returns plenty of results for "convert table to XML" (and the reverse), would this help? Would something going directly from table to key/value pairs be better? Have you tried tackling this problem in a generalized way?

When you say "I spend a tremendous amount of time" do you mean this is a lot of development time, or are you referring to computing time?
Personally I'd be wary of mixing a RDBMS with a non-RDBMS solution, because this will probably create problems when the two different paradigms clash.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight