Solr schema design: fitting time-series data - solr

I am trying to fit the following data in Solr to support flexible queries and would like to get your input on the same. I have data about users say:
contentID (assume uuid),
platform (eg. website, mobile etc),
softwareVersion (eg. sw1.1, sw2.5, ..etc),
regionId (eg. us144, uk123, etc..)
....
and few more other such fields. This data is partially pre aggregated (read Hadoop jobs): so let’s assume for "contentID = uuid123 and platform = mobile and softwareVersion = sw1.2 and regionId = ANY" I have data in format:
timestamp pre-aggregated data [ uniques, total]
Jan 15 [ 12, 4]
Jan 14 [ 4, 3]
Jan 13 [ 8, 7]
... ...
And then I also have less granular data say "contentID = uuid123 and platform = mobile and softwareVersion = ANY and regionId = ANY (These values will be more than above table since granularity is reduced)
timestamp : pre-aggregated data [uniques, total]
Jan 15 [ 100, 40]
Jan 14 [ 45, 30]
... ...
I'll get queries like "contentID = uuid123 and platform = mobile" , give sum of 'uniques' for Jan15 - Jan13 or for "contentID=uuid123 and platform=mobile and softwareVersion=sw1.2", give sum of 'total' for Jan15 - Jan01.
I was thinking of simple schema where documents will be like (first example above):
{
"contentID": "uuid12349789",
"platform" : "mobile",
"softwareVersion": "sw1.2",
"regionId": "ANY",
"ts" : "2017-01-15T01:01:21Z",
"unique": 12,
"total": 4
}
second example from above:
{
"contentID": "uuid12349789",
"platform" : "mobile",
"softwareVersion": "ANY",
"regionId": "ANY",
"ts" : "2017-01-15T01:01:21Z",
"unique": 100,
"total": 40
}
Possible optimization:
{
"contentID": "uuid12349789",
"platform.mobile.softwareVersion.sw1.2.region.us12" : {
"unique": 12,
"total": 4
},
"platform.mobile.softwareVersion.sw1.2.region.ANY" : {
"unique": 100,
"total": 40
},
"ts" : "2017-01-15T01:01:21Z"
}
Challenges: Number of such rows is very large and it'll grow exponentially with every new field - For instance if I go with above suggested schema, I'll end up storing a new document for each combination of contentID,platform,softwareVersion,regionId. Now if we throw in another field to this document, number of combinations increase exponentially.I have more than a billion such combination rows already.
I am hoping to find advice by experts if
Multiple such fields can be fit in same document for different 'ts' such that range queries are possible on it.
time range (ts) can be fit in same document as a list(?) (to reduce number of rows). I know multivalued fields don't support complex data types, but if anything else can be done with the data/schema to reduce query time and number of rows.
The number of these rows are very large, for sure more than 1billion (if we go with the schema I was suggesting). What schema would you suggest for this that'll fit query requirements?
FYI: All queries will be exact match on fields (no partial or tokenized), so no analysis on fields is necessary. And almost all queries are range queries.

You are trying to store query time results of all the possible combination of attributes values. Thats just too much duplicate data. Rather you store each observation and the attributes as a single data point just once. so if you had 'n' observations and if you add an additional attribute, it would grow additively, not exponentially. And if you needed data for a certain combination of attributes, you filter/aggregate them at query time.
{
"contentID": "uuid12349789",
"ts" : "2017-01-15T01:01:21Z",
"observation": 10001,
"attr-platform" : "mobile",
"attr-softwareVersion": "sw1.2",
"attr-regionId": "US",
}

Related

couchdb showing different record count with pagination?

I wanted to calculate how many records couchDB has for a database. So hit the below API and it returns json which has records count as well.
GET http://localhost:20984/mydb/
jSON
{
"db_name": "mydb",
"update_seq": "25577-g1AAAAFLeJ4mDJtXMoSQMrqYcpacSjLYwGSDA1ACqhyPljpJ7xKF0CU7gcrXYlX6QGI0vtgpWJ4lT6AKIW4tTYLACqwZ0c",
"sizes": {
"file": 20881199,
"external": 11977342,
"active": 16542736
},
"purge_seq": 0,
"other": {
"data_size": 11977342
},
"doc_del_count": 0,
"doc_count": 25569,
"disk_size": 20881199,
"disk_format_version": 6,
"data_size": 16542736,
"compact_running": false,
"cluster": {
"q": 8,
"n": 1,
"w": 1,
"r": 1
},
"instance_start_time": "0"
}
Here doc_count is 25569 so i assume total records are 25569.
But when i set document_per_page to 100 & begin to view the records it shows 100 records for first page & 200 records for 2 pages & so on. And if i keep doing this way it shows me more than 35000 records.
Now my question is if total records are 25569 then how couchDB is showing me records more than 35000 with pagination ?
You're right about doc_count, it reports the number of documents in the specified database.
Presuming you're using Fauxton and in there, you switch "Documents per page" to 100, I was able to reproduce the described behavior. When I repeatedly press the next arrow in a short interval, the displayed data and the numbers on the paginator went out of sync. Therefore this seems to be a bug of Fauxton.

MongoDB design of a database

So, I'm designing the model for the documents that I'll insert in my database and I have a question about the design. Is it better to just insert more documents in my collections or fewer nested documents?
Example:
sale:{
store_id : "2",
vendor_id: "2,
points : 100
}
sale:{
store_id : "2",
vendor_id: "2,
points : 100
}
sale:{
store_id : "2",
vendor_id: "2,
points : 100
}
sale:{
store_id : "4",
vendor_id: "3,
points : 100
}
sale:{
store_id : "4",
vendor_id: "1,
points : 100
}
So,in this not nested example if I have N sales, I'll have N sales, inside my collections. But if I try to nest, my example will be:
stores:{ [
store_id : "2"
vendor : [
vendor_id : "2"
sales : [
points : 100
],
[
points : 100
],
[
points : 100
]
]
],
[
store_id: 4
vendor : [
vendor_id : 3
sales : [
point : 100
]
],
[
vendor_id : 1
sale : [
point : 100
]
]
] };
In this example, I nest all my sales.
So, my question is: to create reports and analyze data, which one is faster? If I want to see which store sold more for example, will it be faster to analyze nested documents or one line documents?
Thank you in advance.
The answer is pretty simple. If you know there are gonna be an a lot of sales and its not limited number, you have to go for a separate collection for sales. Mongodb is designed to perform amazingly fast even if there are a million documents in a collection but interestingly you are gonna face a lot of issues by nesting.
Also there is a 16mb document size limit in mongodb, so eventually after a while your one store document will reach that limit and it will make things pretty ugly.
It’s quite straight that you should go for a separate collection.
You can also read this blog and it will clear things out for you
https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-1

Is it possible to change the scoring profile based on the number of tags?

I have a document with a collection of strings representing the number of times that document appears in a region (tags). For example:
[{
"id": "A"
// other properties
"regions": ["3", "3", "3", "2"] // Appears 3 times in region "3" and once in region "2"
},
{
"id": "B"
// other properties
"regions": ["3", "3", "1"] // Appears twice in region "3" and once in region "1"
}]
I tried using a custom scoring profile of type Tag, but I don't see how to give documents with more regions a better score. In other words, I want document A that appears 3 times in region 3 to show before document B that only appears twice in region 3.
FYI, the reason we chose to represent regions this way is because there are way too many regions and not all documents appear in all regions. More details here
Is this doable? This way or another way?
The tag scoring profile checks for an existence of a tag. If the tag appears multiple times, it has no effect on the score.
I've read your other post here. One solution you could consider (which is not exactly what you want) is to bucket the regions based on count. For example, you'd have a collection of regions where the document shows up less than 10 times, between 10 and 50, between 50 and 100 (pick the ranges in a way that make sense for the distribution of region occurrences in your scenario). You're documents would look like this:
{
"id": "A"
"regions10": ["3", "2"] // Appears in region 3 and 2 less than 10 times
"regions50": ["1"] // Appears in region 1 between 10 and 50 times
}
Then, you could use a Weights scoring profile to boost documents that matched in the higher count regions:
"scoringProfiles": [
{
"name": "boostRegions",
"text": {
"weights": {
"regions10": 1,
"regions50": 2,
"regions100": 3
}
}
}
This is not a good solution if you need strict ordering based on the region count, you can't precompute the region counts, or the entire range of value is large (say 0 to 2^31) while the individual buckets need to be small (you'd end up with too many fields).
The problem you have is a data modeling problem. You're trying to retrieve documents based on the property of the document, which is whether it contains a region in a set of regions, but score/boost the document based on the properties of the region, not the document. You'd have to have a document in the index for each document-region pair an a property with the number of times given document appeared in that region.

How t optimize read time in a postgresql database

I have a slight problem with a postgresql database with exploding read times.
Background info:
two tables, both with only 4 columns: uuid (uuid), timestamp (bigint), type (text) and value (double) in one, values (double[]) in the other. (yes, I thought about combining it in one table... decision on that isn't in my hands).
Given that only a fairly small amount of the held data is needed for each "projectrun", I'm already copying the needed data to tables dedicated to each projectrun. Now the interesting part starts, when I try to read the data:
CREATE TABLE fake_timeseries1
(
"timestamp" bigint,
uuid uuid,
value double precision,
type text COLLATE "default".pg_catalog
)
WITH (
OIDS = FALSE
)
TABLESPACE pg_default;
ALTER TABLE fake_timeseries1
OWNER to user;
CREATE INDEX fake_timeseries1_timestamp_idx
ON fake_timeseries1 USING btree
(timestamp)
TABLESPACE pg_default;
ALTER TABLE fake_timeseries1
CLUSTER ON fake_timeseries1_timestamp_idx;
From that temporary table I do:
"SELECT * FROM table_name WHERE timestamp BETWEEN ? AND ? ;"
Simple enough, should work rather fast, right? Wrong.
At the moment I'm testing with small batches (only x*40k rows, returning 25% of them).
For 10k rows, it takes "only" 6 sec, 20k already 34 sec, and for 40k rows (out of a mere 160k) it already takes 3 minutes per table ... 6 minutes for a mere 6 Mb of data. (yes, we are at a gb line, so it's probably no bottleneck there)
I already tried using an index and cluster on timestamp, but that does slow it down even more. Interestingly not on the creation of the temporary tables, but rather when reading the data.
What could I do to speed up the read process? It needs to be able to read those 10-50k rows in less than 5 minutes (preferably less than 1 minute) from a table that holds not 160k rows, but rather tens of millions.
What could be responsible for a simple Select being as slow as creating the whole table in the first place? (3 mins read vs. 3.5 mins create).
Thank you in advance.
As wished an analyze (for 20k out of 80k):
"Execution Time": 27.501,
"Planning Time": 0.514,
"Plan": {
"Filter": "((\"timestamp\" >= '1483224970970'::bigint) AND (\"timestamp\" <= '1483232170970'::bigint))",
"Node Type": "Seq Scan",
"Relation Name": "fake_timeseries1",
"Alias": "fake_timeseries1",
"Actual Rows": 79552,
"Rows Removed by Filter": 0,
"Actual Loops": 1
},
"Triggers": []
The real execution time was 34.047 seconds.
UPDATE:
continued testing with different test data sets. The following is an analyze from a significantly larger testset, where I read only 0.25% of the data... still using seq scan. Anyone an idea?
[
{
"Execution Time": 7121.59,
"Planning Time": 0.124,
"Plan": {
"Filter": "((\"timestamp\" >= '1483224200000'::bigint) AND (\"timestamp\" <= '1483233200000'::bigint))",
"Node Type": "Seq Scan",
"Relation Name": "fake_forecast",
"Alias": "fake_forecast",
"Actual Rows": 171859,
"Rows Removed by Filter": 67490381,
"Actual Loops": 1
},
"Triggers": []
}
]
UPDATE: After even more testing, on a second PostgresQL Database, it seems that I somehow have hit a hard cap.
Whatever I do, the max I can get is 3.3k rows from those two tables per second. And that's only if I use the sweet spot of calling for 20-80k rows in a large batch. Which take 6 resp. 24 seconds even on a DB on my own machine.
Is there nothing that can be done (except better hardware) to speed this up?

Modeling Ranking (scores) in Non Relational DataBase (NoSQL)

I'm using Google App Engine, so I'm using a Non relational database (NoSQL). My question is:
Which is the best option to modeling a rank (ranking of players) using their scores?
For example, my players are:
Player { String name, int score}
I want to know the rank (position) from a player and also get the top 10 players, but I doubt which is the best way.
Thanks.
If your scores are indexed, it's easy to do a datastore query and get players in sorted order.
So if you want the top 10 players, that's pretty trivial.
Getting the ranking for an arbitrary player is really hard. Hard enough that I'd say, avoid it if you can, and if you can't, find a hack way around it.
For example, if you have 50,000 players, and PlayerX is ranked 12,345, the only way to know that is query all the players, and check through each of them, keeping count, until you find PlayerX.
One hack might be to store the player ranking in the player entity, and update it with a cron job that runs once every few hours.
There is a built-in solution in Redis:
First add a few members with a score:
redis> ZADD myzset 1 "one"
(integer) 1
redis> ZADD myzset 2 "two"
(integer) 1
redis> ZADD myzset 3 "three"
(integer) 1
Get the rank of "two":
redis> ZREVRANK myzset "one"
(integer) 2
(Index starts at 0)
And if you want the current order:
redis> ZREVRANGE myzset 0 -1
1) "three"
2) "two"
3) "one"
See ZREVRANGE and ZREVRANK in redis documentation.
A suitable representation of this in JSON would be:
"players" : [
{
"name" : "John",
"score" : 15
},
{
"name" : "Swadq",
"score" : 7
},
{
"name" : "Jane",
"score" : 22
}
]
For examples of how to sort this:
PHP: How to sort an array of associative arrays by value of a given key in PHP?
JavaScript: How to sort an array of associative arrays by value of a given key in PHP?
JavaScript general sorting: http://www.breakingpar.com/bkp/home.nsf/0/87256B280015193F87256C8D00514FA4
You could set up your index.yaml like so:
- kind: Player
properties:
- name: score
direction: ascending
To get a player's score you just need to make a pass over the players (while keeping a count) and cache the result to speed further searches for that player.

Resources