Best databasing practice for tracing non-homogenous events - database

Suppose I have the the following event data scheme:
event_record_unique_id: long
event_timestamp: long
session_id: long
event_id: int
event_data: data # concrete type depends on event_id
... so, the contents of the data may depend on, let's say 500, event_ids, leading to 200 different concrete data types for "data". For example:
{
event_record_unique_id: 17126721
event_timestamp: 1234
session_id: 3452
event_id: 50
event_data: {
user_id: 123
page_id: 789
}
}
{
event_record_unique_id: 1712672123
event_record_unique_id: 17126723
event_timestamp: 1234
session_id: 3454
event_id: 51
event_data: {
user_id: 124
button_id: 789
}
}
{
event_timestamp: 1234
session_id: 3454
event_id: 51
event_data: {
crash_report: "text"
device_id: "12312"
}
}
Also:
many of the event_data attributes appear in many of the concrete event_data objects
I need to perform indexed searches on some of the event_data attributes (e.g. find me all the records where user_id=X )
there's a continuing need to keep on adding event types and new attributes
the above data structure is always trivially flattened so that a single record can be represented equivalently as a row with N columns where (and attributes name/type collision
are solved by renaming attributes).
The naive RDBMS approach would involved making ~500 tables (one per concrete type of "data"). I've discounted this approach (= excessive waste of human effort in modelling). Plus, I cannot easily search all records over user_id (since user_id appears in very many tables).
Flattening the structure in an RDBMS is also quite costly (N-8 of the elements are NULL and contain no information).
Mongodb-type document database solutions appear to be a good, however, space costs seems quite high if attribute names are held with each record, not much better than an RDBMS. However, this does allow me to index by fields in the data object.
For me, an ideal data representation of this would be a table that is optimized to allow rows with many null elements (e.g. by keeping an active column bitmask per row). Or a document DB in which a document collection maintains a library of document schemas used enable compacting the data (and each document having reference to its schema).
What kind of database would people recommend for the above example case?

MS SQL Server 2008 and up have Sparse Columns. Up to 30,000 can be added in a table, and they can be indexed (filtered indexes are recommended). Or so says BOL, I have not used them myself. This would results in a single very large table that might support what you need.
With that said, I don't know it would be particularly efficient. Some math:
Assume 10 rows a second
becomes 10*60*60*24 = 864,000 rows a day
or 315,360,000 rows a year
with a very rough over-estimate of 50 bytes a row
is about 14GB a year
for how many years do you have to keep the data?
and double that if it's more like 20 rows per second
So storage doesn't seem too way out of line... but I don't know, you want to work up some serious size projection factors. And that's just storage, what do you want or need to do with the data? Is retrieval time for specified rows important? What about analysis and data mining? I'm a SQL guy through and through, and I think it could be done, but this pretty much is the kind of problem that Hadoop and NoSQL solutions were devised for, and it could well be worth your time to thoroughly investigating those options.

Related

A field with big array on mongodb

I am a beginner at Mongo and I made a data base with the following topology.
Some fields of metadata and one field that contain the experiment results.
experiment results- vector of integers with ~150,000 values
status = db.DataTest.insert_one(
{
"person_num" : num,
"life_cycle" : cycle,
"other_metadata" : meta_data,
"results_of_experiment": big_array
}
)
I inserted something like 7500 of those documents
Its occupied 8GB of memory and work really slowly for find operations.
I don't need those experiment results to search by them only the option to retrieve them from the DB as chunk of data.
Is there another solution to store on the DB the experiment results?
Is using "gridfs" is relevant to this case and not too complicated?
Based on your comments, the most common query is
db.DataTest.find( { "life_cycle": { $gt: 800 } }).limit(5)
Without an index on the life_cycle field, MongoDB is forced to do a collection scan. That is, fetch & evaluate all documents in your collection one by one. In a large collection, this will take a long time.
MongoDB does not create indexes automatically. You would have to observe your most common queries, and create indexes to support those queries. As far as I know, there is no automatic index creation in any database software; SQL, NoSQL, or otherwise.
Database indexing is a deep subject and cannot be explained in a short answer.
Having said that, if you create an index on the life_cycle field, it should improve your query times but only for the query you posted above. Other query types would likely require different indexes. You can do so in the mongo shell:
db.DataTest.createIndex({life_cycle: 1})
I encourage you to read these pages to understand more about indexing in MongoDB:
https://docs.mongodb.com/manual/indexes/
https://docs.mongodb.com/manual/applications/indexes/
https://docs.mongodb.com/manual/tutorial/create-indexes-to-support-queries/

Azure Search: Order by dynamic data

I have an Azure Search index composed of documents that can "occur" in multiple regions any number of times. For example Document1 has 5 occurrences in Region1, 20 occurrences in Region2. Document2 has 54 occurrences in Region1, and 10 occurrences in Region3. Document3 has 10 occurrences in Region3. We want to use Azure Search for searching and suggestions, but base the order on number of occurrences on a region. For example the search for Document from a user in Region1 should return in the order Document2, Document1, Document3 because Document2 has 54 occurrences in that region, while Document1 has 5 occurrences and Document3 has none.
[
{ 'name': 'Document1', 'regions': ['Region1|5', 'Region2|20'] },
{ 'name': 'Document2', 'regions': ['Region1|54', 'Region3|10'] },
{ 'name': 'Document3', 'regions': ['Region3|10'] }
]
I'm having a hard time figuring out how to structure the index or if it is even possible with Azure Search. Please note that the number of regions is potentially in the hundreds of thousands. I am ok with changing regions for center points and use geospatial functions instead, but I still don't see how to lay the data or query it.
What is the best way to structure the index and how would one make the query possible?
tl;dr - There might be a solution for you based on some assumptions I have. Please read on, and if possible try to provide some validations around my assumptions for me to give a better answer (if such an answer exists).
Unfortunately, Azure search doesn't have an out-of-the box approach for your scenario. There might be a work around however - instead of the regions collection being something like ['Region1|5', 'Region2|20'], you could try to structure the document such that it appears to be ['Region1', 'Region1',...., 'Region2', 'Region2', ...] (that is, make the collection contain n elements of Region1 and m elements of Region2 where in your case n = 5 and m = 10.
Then you should simply be able to search using the Region that the user originates from and I believe the results should be ordered based on which document's collection column (regions) contains more occurrences of the particular queried region.
This approach helps you in 2 ways:
You could try adding each region as a column in the search index and use some queries to get the kind of result you want. However, since you mention there might be hundreds of thousands of such regions, it might not work well with our service limits. If however that's not the case, I highly recommend adding each region as a column, so that you can query/order by the column value.
With the replication of the string approach, you can have arbitrarily large collections as I believe Azure search does not have any limitations with regard to the number of elements in a collection. Also the nice thing here is, if your document will have a sparse number of regions (i.e., you may have 100s of 1000s of regions, but any given document will only have few regions enumerated), you should be able to achieve what you want. If that's not the case however, this approach might not be super nice/efficient and might even be painful for you to manage.
Also, just FYI I'd recommend taking a look at the scoring profiles feature
and especially the tag function to see if that might in any way be useful to you.

Referential Integrity with Neo4j

I am working on a project that uses a graph database to hold click data for a search engine. The nodes can be search terms or urls, and the edges hold a weight attribute, and a percentage of times that search led to someone clicking that URL.
Number of times the URL was clicked / Number of times term was searched
My issue is that when I update the edges, the percentage will be accurate, but if I later update the search term node and the searched count changes, the edge will no longer have the correct percentage. Is there a way in Neo4j to keep referential integrity? like a foreign key type thing?
The following info might be helpful.
If you stored the number of clicks instead of the percentage, there is no way to get inconsistent data. For example:
(:Term {id: 1, nSearches: 123})-[:HAS_URL {weight: 2, nClicks: 17}]->(:Url {id: 2})
With this data model, you'd calculate the percentage whenever you needed it.
For example, to find the 10 terms that have the highest percentage of visits to a specific URL:
MATCH (term:Term)-[r:HAS_URL]->(url:Url {id: 2})
RETURN url, term
ORDER BY r.nClicks/term.nSearches DESC
LIMIT 10;
But notice that the inverse query (find the 10 URLs that have the highest percentage of visits from a specific term) does not even require that you calculate the percentage! This is because in this case the percentages all have the same denominator. So, you can just use nClicks for sorting:
MATCH (term:Term {id: 1})-[r:HAS_URL]->(url:Url)
RETURN term, url
ORDER BY r.nClicks DESC
LIMIT 10;
Unfortunately no, neo4j doesn't support this. You can still do it, with one of two methods. I'll tell you what they both are, then make a recommendation.
Relative to your relational database, I don't think you're looking for a foreign key or "referential integrity" -- I think what you're looking for is more like a trigger. A trigger is like a function or procedure that executes when data changes. In your case, it'd probably be good to have trigger functions that re-calculated all of the weight percentages on incident edges.
Option 1 - The capable Max De Marzi has got you covered there with a description of how you can do triggers in neo4j. Spoiling the surprise, there's a TransactionEventHandler in the java API. When the right kind of transaction comes through, you can catch that and do extra stuff.
Option 2 - the server provides an extension/plugin mechanism so that you could write this on your own. This is a big hammer, it can do just about anything, but it's harder to wield, too.
I'd recommend you look into Max's post and the TransactionEventHandler. You might then implement public void afterCommit(TransactionData transactionData, Object o). In that method, you'd check out the transaction data to see if it was something of interest (not all transactions would be of interest). If the transaction updated a search term node or searched count changes, then I'd go do your recomputation, fix your weights, and you should be good.

Cassandra Time-Series: Allow Filtering, Buckets, or Other

I know there are many time-series questions on here but mine does not seem to comfortably fit with the given solutions. I am also new to Cassandra so I might be approaching this with the wrong mindset. Bear with me.
I am receiving search data in the form:
datetime_searched, term_used, product_found
and the query I would like to make:
Given a start-date and an end-date, return all term-product pairs that fall in that time window. Initially, the window will be a month long. This may (read: will) change.
For example, given the following data:
2013-11-20 00:00:00, "christmas", "decorated tree"
2014-12-01 20:00:00, "christmas", "wrapping paper"
2014-12-23 15:00:00, "christmas", "decorated tree" (duplicate term-product)
and a query for the time-range 2014-12-01 to 2015-01-01, I would like to be able to retrieve:
"christmas", "wrapping paper"
"christmas", "decorated tree"
My initial approach looked like most examples for time series data:
CREATE TABLE search_terms (
datetime_searched timestamp,
term_used text,
product_found text,
PRIMARY KEY (term_used, date_searched)
);
SELECT term_used, product_found
FROM search_terms
WHERE datetime_searched > [start]
AND datetime_searched < [end];
but this requires me to have secondary indexes and/or allow filtering, which seems to be something I should avoid if I'm only capturing a small percentage of the data being filtered.
My second idea was to create time buckets, but this solution seems to work only if I limit the query to the buckets. It also seems to create hotspots - in my initial case, a month-long hotspot. For example, for daily buckets:
CREATE TABLE search_terms_by_day (
datetime_searched timestamp,
day_searched timestamp,
term_used text,
product_found text,
PRIMARY KEY (day_searched)
);
SELECT term_used, product_found
FROM search_terms_by_day
WHERE day_searched=[my limited query's bucket];
So what are my options? Do I constrain my requests to the bucket size, possibly creating many CFs with different bucket sizes, all while creating hotspots; am I forced to use secondary indexes; or is there another option I am unaware of?
Thanks in advance.
Writing this question has helped me sort out some of my problems. I've come up with an alternative solution which I am more-or-less happy with but will need some fine-tuning.
There is the possibility of calculating all of the time buckets we need to access, making a query for each of these buckets with a filter to grab the entries we need.
CREATE TABLE search_terms_by_day_of_year (
day_searched int, // 1 - 366
datetime_searched timestamp,
term_used text,
product_found text,
PRIMARY KEY (day_searched, datetime_searched, term_used, product_found)
);
// Make N of these, with a different day_searched
SELECT term_used, product_found
FROM search_terms_by_week
WHERE day_searched = 51
AND datetime_searched > [start]
AND datetime_searched < [end]
Positives:
Avoids scanning all of the search data
Allows smaller time buckets which in turn reduces our hot spots
Negatives:
Requires logic to determine the partition keys needed
There will be a hotspot for writes for the period of the bucket (in the above example, one day)
A poor choice of bucket size in relation to the query range will require looking through all buckets, negating any gains.
Multiple queries to the database. The smaller the bucket, the more calls needed.
Please let me know if there is a better solution to this

Row count of a column family in Cassandra

Is there a way to get a row count (key count) of a single column family in Cassandra? get_count can only be used to get the column count.
For instance, if I have a column family containing users and wanted to get the number of users. How could I do it? Each user is it's own row.
If you are working on a large data set and are okay with a pretty good approximation, I highly recommend using the command:
nodetool --host <hostname> cfstats
This will dump out a list for each column family looking like this:
Column Family: widgets
SSTable count: 11
Space used (live): 4295810363
Space used (total): 4295810363
Number of Keys (estimate): 9709824
Memtable Columns Count: 99008
Memtable Data Size: 150297312
Memtable Switch Count: 434
Read Count: 9716802
Read Latency: 0.036 ms.
Write Count: 9716806
Write Latency: 0.024 ms.
Pending Tasks: 0
Bloom Filter False Postives: 10428
Bloom Filter False Ratio: 1.00000
Bloom Filter Space Used: 18216448
Compacted row minimum size: 771
Compacted row maximum size: 263210
Compacted row mean size: 1634
The "Number of Keys (estimate)" row is a good guess across the cluster and the performance is a lot faster than explicit count approaches.
If you are using an order-preserving partitioner, you can do this with get_range_slice or get_key_range.
If you are not, you will need to store your user ids in a special row.
I found an excellent article on this here.. http://www.planetcassandra.org/blog/post/counting-keys-in-cassandra
select count(*) from cf limit 1000000
Above statement can be used if we have an approximate upper bound known before hand. I found this useful for my case.
[Edit: This answer is out of date as of Cassandra 0.8.1 -- please see the Counters entry in the Cassandra Wiki for the correct way to handle Counter Columns in Cassandra.]
I'm new to Cassandra, but I have messed around a lot with Google's App Engine. If no other solution presents itself, you may consider keeping a separate counter in a platform that supports atomic increment operations like memcached. I know that Cassandra is working on atomic counter increment/decrement functionality, but it's not yet ready for prime time.
I can only post one hyperlink because I'm new, so for progress on counter support see the link in my comment below.
Note that this thread suggests ZooKeeper, memcached, and redis as possible solutions. My personal preference would be memcached.
http://www.mail-archive.com/user#cassandra.apache.org/msg03965.html
There is always map/reduce but that probably goes without saying. If you have that with hive or pig, then you can do it for any table across the cluster though I am not sure tasktrackers know about cassandra locality and so it may have to stream the whole table across the network so you get task trackers on cassandra nodes but the data they receive may be from another cassandra node :(. I would love to hear if anyone knows for sure though.
NOTE: We are setting up map/reduce on cassandra mainly because if we want an index later, we can map/reduce one into cassandra.
I have been getting the counts like this after I convert the data into a hash in PHP.

Resources