In Cassandra, how do I access historical data?

In Cassandra, how do I access historical data? - database

Cassandra uses a timestamp system to serve up the most recent records. How do I display a list of all values & timestamps for a particular column?
For example, I run this command for a Column family called 'Users':
set Users[jsmith][first]='John'
When I get the 'first' column, I see the following:
get Users[jsmith][first]
=> (column=first value=John, timestamp=1287604215498000
Then, I update the 'first' column to Charlie.
set Users[jsmith][first]='Charlie'
I will now see the following
get Users[jsmith][first]
=> (column=first value=Charlie, timestamp=1299980101189000
My question is how do I get all values (over time) for this column? I want to see something like get Users[jsmith][first] ==> John (timestamp), Charlie (timestamp).

You don't. Cassandra departs from the BigTable model here: only the most recent version is retained.

Related

Django Query: Annotate with `count` of a window

I search for a query which is pretty similar to this one. But as an extension, I do not want to count all objects, but just over the ones, that are fairly recent.
In my case, there are two models. Let one be the Source and one be the Data. As result I'd like to get a list of all Sources ordered by the number of data records, that has been collected during the last week.
For me it is not iteresting, how many data records have been collected in total, but if there is a recent activity of that source.
Using the following code snippet from the above link, I cannot make up how to subquery the Data Table before.
from django.db.models import Count
activity_per_source = Source.objects.annotate(count_data_records=Count('Data')) \
.order_by('-count_data_records')
The only ways I came up with, would be to write native SQL or to process this in a loop and individual queries. Is there a Django-Query version?
(I use a MySQL database and Django 1.5.4)

Checkout out the docs on the order of annotate and filter: https://docs.djangoproject.com/en/1.5/topics/db/aggregation/#order-of-annotate-and-filter-clauses
Try something along the lines of:
activity_per_source = Source.objects.\
filter(data__date__gte=one_week_ago).\
annotate(count_data_records=Count('Data')).\
order_by('-count_data_records').distinct()

There is a way of doing that mixing Django queries with SQL via extra:
start_date = datetime.date.today() - 7
activity_per_source = (
Source.objects
.extra(where=["(select max(date) from app_data where source_id=app_source.id) >= '%s'"
% start_date.strftime('%Y-%m-%d')])
.annotate(count_data_records=Count('Data'))
.order_by('-count_data_records'))
The where part will filter the Sources by its Data last date.
Note: replace table and field names with actual ones.

Django Query Optimisation

I am working currently on telecom analytics project and newbie in query optimisation. To show result in browser it takes a full minute while just 45,000 records are to be accessed. Could you please suggest on ways to reduce time for showing results.
I wrote following query to find call-duration of a person of age-group:
sigma=0
popn=len(Demo.objects.filter(age_group=age))
card_list=[Demo.objects.filter(age_group=age)[i].card_no
for i in range(popn)]
for card in card_list:
dic=Fact_table.objects.filter(card_no=card.aggregate(Sum('duration'))
sigma+=dic['duration__sum']
avgDur=sigma/popn
Above code is within for loop to iterate over age-groups.
Model is as follows:
class Demo(models.Model):
card_no=models.CharField(max_length=20,primary_key=True)
gender=models.IntegerField()
age=models.IntegerField()
age_group=models.IntegerField()
class Fact_table(models.Model):
pri_key=models.BigIntegerField(primary_key=True)
card_no=models.CharField(max_length=20)
duration=models.IntegerField()
time_8bit=models.CharField(max_length=8)
time_of_day=models.IntegerField()
isBusinessHr=models.IntegerField()
Day_of_week=models.IntegerField()
Day=models.IntegerField()
Thanks

Try that:
sigma=0
demo_by_age = Demo.objects.filter(age_group=age);
popn=demo_by_age.count() #One
card_list = demo_by_age.values_list('card_no', flat=True) # Two
dic = Fact_table.objects.filter(card_no__in=card_list).aggregate(Sum('duration') #Three
sigma = dic['duration__sum']
avgDur=sigma/popn

A statement like card_list=[Demo.objects.filter(age_group=age)[i].card_no for i in range(popn)] will generate popn seperate queries and database hits. The query in the for-loop will also hit the database popn times. As a general rule, you should try to minimize the amount of queries you use, and you should only select the records you need.
With a few adjustments to your code this can be done in just one query.
There's generally no need to manually specify a primary_key, and in all but some very specific cases it's even better not to define any. Django automatically adds an indexed, auto-incremental primary key field. If you need the card_no field as a unique field, and you need to find rows based on this field, use this:
class Demo(models.Model):
card_no = models.SlugField(max_length=20, unique=True)
...
SlugField automatically adds a database index to the column, essentially making selections by this field as fast as when it is a primary key. This still allows other ways to access the table, e.g. foreign keys (as I'll explain in my next point), to use the (slightly) faster integer field specified by Django, and will ease the use of the model in Django.
If you need to relate an object to an object in another table, use models.ForeignKey. Django gives you a whole set of new functionality that not only makes it easier to use the models, it also makes a lot of queries faster by using JOIN clauses in the SQL query. So for you example:
class Fact_table(models.Model):
card = models.ForeignKey(Demo, related_name='facts')
...
The related_name fields allows you to access all Fact_table objects related to a Demo instance by using instance.facts in Django. (See https://docs.djangoproject.com/en/dev/ref/models/fields/#module-django.db.models.fields.related)
With these two changes, your query (including the loop over the different age_groups) can be changed into a blazing-fast one-hit query giving you the average duration of calls made by each age_group:
age_groups = Demo.objects.values('age_group').annotate(duration_avg=Avg('facts__duration'))
for group in age_groups:
print "Age group: %s - Average duration: %s" % group['age_group'], group['duration_avg']
.values('age_group') selects just the age_group field from the Demo's database table. .annotate(duration_avg=Avg('facts__duration')) takes every unique result from values (thus each unique age_group), and for each unique result will fetch all Fact_table objects related to any Demo object within that age_group, and calculate the average of all the duration fields - all in a single query.

How do travel websites implement the sorting of search results?

For example you make a search for a hotel in London and get 250 hotels out of which 25 hotels are shown on first page. On each page user has an option to sort the hotels based on price, name, user-reviews etc. Now the intelligent thing to do will be to only get the first 25 hotels on the first page from the database. When user moves to page 2, make another database query for next 25 hotels and keep the previous results in cache.
Now consider this, user is on page 1 and sees 25 hotels sorted by price and now he sorts them based on user-ratings, in this case, we should keep the hotels we already got in cache and only request for additional hotels. How is that implemented? Is there something built in any language (preferably php) or we have to implement it from scratch using multiple queries?

This is usually done as follows:
The query is executed with order by the required field, and with a top (in some databases limit) set to (page_index + 1) * entries_per_page results. The query returns a random-access rowset (you might also hear of this referred to as a resultset or a recordset depending on the database library you are using) which supports methods such as MoveTo( row_index ) and MoveNext(). So, we execute MoveTo( page_index * entries_per_page ) and then we read and display entries_per_page results. The rowset generally also offers a Count property which we invoke to get the total number of rows that would be fetched by the query if we ever let it run to the end (which of course we don't) so that we can compute and show the user how many pages exist.

Is it possible to query a value (or a value set) with cassandra even if we don't know the key (or key range) in advance?

I am going to express the idea in SQL:
SELECT key,value
FROM table1
WHERE value > 10
Or do we always need to know the key?

I suppose you can use secondary indexes which are available since version 0.7 of casssandra.
You might also checkout the following answer: Cassandra and Secondary-Indexes, how do they work internally?
it is recommended to use secondary indexes only for low-cardinality columns, which means for columns which do not have many different values (e.g. columns like 'status' or 'priority' which have usually only a handful different values like 'high', 'medium', 'low').
In case you are using Hector as your cassandra client you can find information here how to use them:
https://github.com/rantav/hector/wiki/User-Guide

Yes, of course, for example, you can use *
select * from CF where value = 10
If you use the Hector API (e.g. CqlQuery), you can get a list of rows back from this query.
Note, currently for secondary indexes, you must have at least one equality conditional, so your query with just value > 10 would not work. See this question

Datastore fetch on two filters alternative?

I have a datastore entity called Game and two fields in it called playerOne and playerTwo. Either of these fields stores a username.
I need to search on the Game entity and return a MAX of 30 games where the username can be either playerOne OR playerTwo...
So in a relational database you would go:
SELECT * FROM Game WHERE playerOne='username' OR playerTwo='username' LIMIT 30
But in big table you can't filter on more than one field! I can't fetch 10 from one and 10 from the other as the number from each can be variable and in createdDate order.
How would you do this in your datastore?

The quick answer is create a StringListProperty that contains [player_a, player_b] and then simply use the multi-value index made out of that:
games = Game.all().filter("players =", player_find)

You can not do an OR query on the datastore using different fields. If you have to keep your current entity model then you have to do two queries.
1) filtering on playerOne and limiting to 30
2) filtering on playerTwo and limiting to (30 - result size of query one)
Then merge the results in memory to produce the final set of 30.
Now if you also want some ordering by date, then it will get more tricky. However the SQL query you wrote doesn't have any ordering so I omitted it aswell.
However if you can change the entity model then a good way to achive what you want is to have a single field containing a list of both usernames.
Then you can do a simple query in the style of:
SELECT * FROM Game WHERE playerBoth = 'username'

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

In Cassandra, how do I access historical data? - database

You don't. Cassandra departs from the BigTable model here: only the most recent version is retained.

Related

Django Query: Annotate with `count` of a window

Django Query Optimisation

How do travel websites implement the sorting of search results?

Is it possible to query a value (or a value set) with cassandra even if we don't know the key (or key range) in advance?

Datastore fetch on two filters alternative?

Categories

Resources

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

In Cassandra, how do I access historical data? - database

You don't. Cassandra departs from the BigTable model here: only the most recent version is retained.

Related

Django Query: Annotate with `count` of a *window*

Django Query Optimisation

How do travel websites implement the sorting of search results?

Is it possible to query a value (or a value set) with cassandra even if we don't know the key (or key range) in advance?

Datastore fetch on two filters alternative?

Categories

Resources

Django Query: Annotate with `count` of a window