Row count of a column family in Cassandra - database

Is there a way to get a row count (key count) of a single column family in Cassandra? get_count can only be used to get the column count.
For instance, if I have a column family containing users and wanted to get the number of users. How could I do it? Each user is it's own row.

If you are working on a large data set and are okay with a pretty good approximation, I highly recommend using the command:
nodetool --host <hostname> cfstats
This will dump out a list for each column family looking like this:
Column Family: widgets
SSTable count: 11
Space used (live): 4295810363
Space used (total): 4295810363
Number of Keys (estimate): 9709824
Memtable Columns Count: 99008
Memtable Data Size: 150297312
Memtable Switch Count: 434
Read Count: 9716802
Read Latency: 0.036 ms.
Write Count: 9716806
Write Latency: 0.024 ms.
Pending Tasks: 0
Bloom Filter False Postives: 10428
Bloom Filter False Ratio: 1.00000
Bloom Filter Space Used: 18216448
Compacted row minimum size: 771
Compacted row maximum size: 263210
Compacted row mean size: 1634
The "Number of Keys (estimate)" row is a good guess across the cluster and the performance is a lot faster than explicit count approaches.

If you are using an order-preserving partitioner, you can do this with get_range_slice or get_key_range.
If you are not, you will need to store your user ids in a special row.

I found an excellent article on this here.. http://www.planetcassandra.org/blog/post/counting-keys-in-cassandra
select count(*) from cf limit 1000000
Above statement can be used if we have an approximate upper bound known before hand. I found this useful for my case.

[Edit: This answer is out of date as of Cassandra 0.8.1 -- please see the Counters entry in the Cassandra Wiki for the correct way to handle Counter Columns in Cassandra.]
I'm new to Cassandra, but I have messed around a lot with Google's App Engine. If no other solution presents itself, you may consider keeping a separate counter in a platform that supports atomic increment operations like memcached. I know that Cassandra is working on atomic counter increment/decrement functionality, but it's not yet ready for prime time.
I can only post one hyperlink because I'm new, so for progress on counter support see the link in my comment below.
Note that this thread suggests ZooKeeper, memcached, and redis as possible solutions. My personal preference would be memcached.
http://www.mail-archive.com/user#cassandra.apache.org/msg03965.html

There is always map/reduce but that probably goes without saying. If you have that with hive or pig, then you can do it for any table across the cluster though I am not sure tasktrackers know about cassandra locality and so it may have to stream the whole table across the network so you get task trackers on cassandra nodes but the data they receive may be from another cassandra node :(. I would love to hear if anyone knows for sure though.
NOTE: We are setting up map/reduce on cassandra mainly because if we want an index later, we can map/reduce one into cassandra.

I have been getting the counts like this after I convert the data into a hash in PHP.

Related

A field with big array on mongodb

I am a beginner at Mongo and I made a data base with the following topology.
Some fields of metadata and one field that contain the experiment results.
experiment results- vector of integers with ~150,000 values
status = db.DataTest.insert_one(
{
"person_num" : num,
"life_cycle" : cycle,
"other_metadata" : meta_data,
"results_of_experiment": big_array
}
)
I inserted something like 7500 of those documents
Its occupied 8GB of memory and work really slowly for find operations.
I don't need those experiment results to search by them only the option to retrieve them from the DB as chunk of data.
Is there another solution to store on the DB the experiment results?
Is using "gridfs" is relevant to this case and not too complicated?
Based on your comments, the most common query is
db.DataTest.find( { "life_cycle": { $gt: 800 } }).limit(5)
Without an index on the life_cycle field, MongoDB is forced to do a collection scan. That is, fetch & evaluate all documents in your collection one by one. In a large collection, this will take a long time.
MongoDB does not create indexes automatically. You would have to observe your most common queries, and create indexes to support those queries. As far as I know, there is no automatic index creation in any database software; SQL, NoSQL, or otherwise.
Database indexing is a deep subject and cannot be explained in a short answer.
Having said that, if you create an index on the life_cycle field, it should improve your query times but only for the query you posted above. Other query types would likely require different indexes. You can do so in the mongo shell:
db.DataTest.createIndex({life_cycle: 1})
I encourage you to read these pages to understand more about indexing in MongoDB:
https://docs.mongodb.com/manual/indexes/
https://docs.mongodb.com/manual/applications/indexes/
https://docs.mongodb.com/manual/tutorial/create-indexes-to-support-queries/

Referential Integrity with Neo4j

I am working on a project that uses a graph database to hold click data for a search engine. The nodes can be search terms or urls, and the edges hold a weight attribute, and a percentage of times that search led to someone clicking that URL.
Number of times the URL was clicked / Number of times term was searched
My issue is that when I update the edges, the percentage will be accurate, but if I later update the search term node and the searched count changes, the edge will no longer have the correct percentage. Is there a way in Neo4j to keep referential integrity? like a foreign key type thing?
The following info might be helpful.
If you stored the number of clicks instead of the percentage, there is no way to get inconsistent data. For example:
(:Term {id: 1, nSearches: 123})-[:HAS_URL {weight: 2, nClicks: 17}]->(:Url {id: 2})
With this data model, you'd calculate the percentage whenever you needed it.
For example, to find the 10 terms that have the highest percentage of visits to a specific URL:
MATCH (term:Term)-[r:HAS_URL]->(url:Url {id: 2})
RETURN url, term
ORDER BY r.nClicks/term.nSearches DESC
LIMIT 10;
But notice that the inverse query (find the 10 URLs that have the highest percentage of visits from a specific term) does not even require that you calculate the percentage! This is because in this case the percentages all have the same denominator. So, you can just use nClicks for sorting:
MATCH (term:Term {id: 1})-[r:HAS_URL]->(url:Url)
RETURN term, url
ORDER BY r.nClicks DESC
LIMIT 10;
Unfortunately no, neo4j doesn't support this. You can still do it, with one of two methods. I'll tell you what they both are, then make a recommendation.
Relative to your relational database, I don't think you're looking for a foreign key or "referential integrity" -- I think what you're looking for is more like a trigger. A trigger is like a function or procedure that executes when data changes. In your case, it'd probably be good to have trigger functions that re-calculated all of the weight percentages on incident edges.
Option 1 - The capable Max De Marzi has got you covered there with a description of how you can do triggers in neo4j. Spoiling the surprise, there's a TransactionEventHandler in the java API. When the right kind of transaction comes through, you can catch that and do extra stuff.
Option 2 - the server provides an extension/plugin mechanism so that you could write this on your own. This is a big hammer, it can do just about anything, but it's harder to wield, too.
I'd recommend you look into Max's post and the TransactionEventHandler. You might then implement public void afterCommit(TransactionData transactionData, Object o). In that method, you'd check out the transaction data to see if it was something of interest (not all transactions would be of interest). If the transaction updated a search term node or searched count changes, then I'd go do your recomputation, fix your weights, and you should be good.

Quickly adding edge counts to a document in ArangoDB

Not too complicated: I want to count the edges of each document and save the number in the document. I've come up with two queries that work; unfortunately since I have millions of edges both are quite slow. Is there a faster way to update documents with a property storing their number of edges? (just a count at a point in time)
AQL queries that are functional but slow:
FOR doc IN Documents
LET inEdgesCount = LENGTH(GRAPH_NEIGHBORS('edgeGraph', doc,{direction: 'inbound', maxDepth:1})
LET outEdgesCount = LENGTH(GRAPH_NEIGHBORS('edgeGraph', doc,{direction: 'outbound', maxDepth:1})
UPDATE doc WITH {inEdgesCount: inEdgesCount, outEdgesCount: outEdgesCount} In Documents
or:
FOR e IN Edges
COLLECT docId = e._to WITH COUNT INTO counter
UPDATE SPLIT(docId,'/')[1] WITH {inEdgeCount: counter}
(and then repeat for outbound edges)
As an aside, is there any way to view either query speed (e.g. FOR executions per second) or percentage completion? I've been trying to judge speed by using LIMITed queries to start with, but the time required doesn't seem to scale linearly.
With ArangoDB 2.8 you can use graph pattern matching traversals to execute this with better performance:
FOR doc IN documents
LET inEdgesCount = LENGTH(FOR v IN 1..1 INBOUND doc GRAPH 'edgeGraph' RETURN 1)
LET outEdgesCount = LENGTH(FOR v IN 1..1 OUTBOUND doc GRAPH 'edgeGraph' RETURN 1)
UPDATE doc WITH
{inEdgesCount: inEdgesCount, outEdgesCount: outEdgesCount} In Documents
Currently ArangoDB doesn't have a way to monitor the progress of long running tasks. With ArangoDB 3.0 we're going to introduce a new monitoring framkework that allows better inspection of whats actually going on in the server. However, with 3.0 it won't be able to gather live statistics; we may see this further down the 3.x road later this year. Judging percentage completion may become possible for easy tasks like creating indices, but on queries its rather going to be the number of documents read/written so far.
We did similar queries for validating whether a graph obeys a power law

Best databasing practice for tracing non-homogenous events

Suppose I have the the following event data scheme:
event_record_unique_id: long
event_timestamp: long
session_id: long
event_id: int
event_data: data # concrete type depends on event_id
... so, the contents of the data may depend on, let's say 500, event_ids, leading to 200 different concrete data types for "data". For example:
{
event_record_unique_id: 17126721
event_timestamp: 1234
session_id: 3452
event_id: 50
event_data: {
user_id: 123
page_id: 789
}
}
{
event_record_unique_id: 1712672123
event_record_unique_id: 17126723
event_timestamp: 1234
session_id: 3454
event_id: 51
event_data: {
user_id: 124
button_id: 789
}
}
{
event_timestamp: 1234
session_id: 3454
event_id: 51
event_data: {
crash_report: "text"
device_id: "12312"
}
}
Also:
many of the event_data attributes appear in many of the concrete event_data objects
I need to perform indexed searches on some of the event_data attributes (e.g. find me all the records where user_id=X )
there's a continuing need to keep on adding event types and new attributes
the above data structure is always trivially flattened so that a single record can be represented equivalently as a row with N columns where (and attributes name/type collision
are solved by renaming attributes).
The naive RDBMS approach would involved making ~500 tables (one per concrete type of "data"). I've discounted this approach (= excessive waste of human effort in modelling). Plus, I cannot easily search all records over user_id (since user_id appears in very many tables).
Flattening the structure in an RDBMS is also quite costly (N-8 of the elements are NULL and contain no information).
Mongodb-type document database solutions appear to be a good, however, space costs seems quite high if attribute names are held with each record, not much better than an RDBMS. However, this does allow me to index by fields in the data object.
For me, an ideal data representation of this would be a table that is optimized to allow rows with many null elements (e.g. by keeping an active column bitmask per row). Or a document DB in which a document collection maintains a library of document schemas used enable compacting the data (and each document having reference to its schema).
What kind of database would people recommend for the above example case?
MS SQL Server 2008 and up have Sparse Columns. Up to 30,000 can be added in a table, and they can be indexed (filtered indexes are recommended). Or so says BOL, I have not used them myself. This would results in a single very large table that might support what you need.
With that said, I don't know it would be particularly efficient. Some math:
Assume 10 rows a second
becomes 10*60*60*24 = 864,000 rows a day
or 315,360,000 rows a year
with a very rough over-estimate of 50 bytes a row
is about 14GB a year
for how many years do you have to keep the data?
and double that if it's more like 20 rows per second
So storage doesn't seem too way out of line... but I don't know, you want to work up some serious size projection factors. And that's just storage, what do you want or need to do with the data? Is retrieval time for specified rows important? What about analysis and data mining? I'm a SQL guy through and through, and I think it could be done, but this pretty much is the kind of problem that Hadoop and NoSQL solutions were devised for, and it could well be worth your time to thoroughly investigating those options.

Sphinx. How fast are Random results?

Does anybody have experience with getting random results from index with +100,000,000 (100 million) records.
The goal is getting 30 results ordered by random, at least 100 times per second.
Actually my records are in MySQL but selecting ORDER BY RAND() from huge tables is the most easiest way to kill MySQL.
Sphinxsearch or whatever what do you recommend?
I dont have that big an index to try.
barry#server:~/modules/sphinx-2.0.1-beta/api# time php test.php -i gi_stemmed --sortby #random --select id
Query '' retrieved 20 of 3067775 matches in 0.081 sec.
Query stats:
Matches:
<SNIP>
real 0m0.100s
user 0m0.010s
sys 0m0.010s
This is on a reasonably powerful dedicated server - that is serving live queries (~20qps)
But to be honest if you dont need filtering (ie each query has a 'WHERE' clause), you can just setup a system that returns random results - can do this with mysql. Just using ORDER BY RAND() is evil (and sphinx while better at sorting than mysql is still doing basically the same thing).
How 'sparse' is your data? If most of your ids are used, can just do soemthing like
$ids = array();
$max = getOne("SELECT MAX(id) FROM table");
foreach(range(1,30) as $idx) {
$ids[] = rand(1,$max);
}
$query = "SELECT * FROM table WHERE id IN (".implode(',',$ids).")";
(may want to use shuffle() in php on the results afterwards as you likly to get the results out of mysql in id order)
Which will be much more efficient. If you do have holes, perhaps just lookup 33 rows. Sometimes will get more than need, (just discard), but you should still get 30 most of the times.
(Of course you could cache the '$max' somewhere, so it doesnt have to be looked up all the time.)
Otherwise you could setup a dedicated 'shuffled' list. Basically a FIFO buffer, have one thread, filling it with random results (perhaps using the above system, using 3000 ids at a time) and then the consumers just read random results directly out of this queue.
FIFO, is not particully easy to implement with mysql, so maybe use a different system - maybe redis, or even just memcache.

Resources