Extracting a test set from a training dataset - database

I have a dataset (where each data is a vector of attributes with their corresponding class label). I want to split the dataset to a training set and testing set. Is there anyway to do this automatically ?

The typical way to split a modeling data set into training, validation, and test sets is to use a random sample. That is, assign a random number between 0 and 1. If you want a 40/30/30 split, then rows where the value is between 0 and 0.4 are in training, between 0.4 and 0.7 in validatino, and 0.7 and 1.0 in test. There is nothing magical about the 40/30/30 slpit. It just happens to be the default in SAS Enterprise Miner (in fact, I often change it to 60/30/10).
There are some tweaks and possible improvements to this. If you know there are important features for modeling, such as geography, then you can do a stratified sample. For this, you would sort the data by the columns and then essentially do an "every nth" record sample. I say "essentially", because this is a little more complicated for a 40% split. To handle this, take the record 10 at a time, choose 4 for the training set, 3 for the validation set, and 3 for the test set. Which particfular ones you choose from the 10 are not important.
A bigger issue is when you have hierarchical data, and virtually all data used for modeling is hierarchies. For instance, you might have customer data with a significant number of columns describing the customer's census tract. If these variables are important as predictors, then you might consider sampling at the census tract level, rather than the customer level. That is, divide the census tracts into three groups (randomly), so 40% of customers go to the training set, 30% to the validation set, and 30% to the test set.
You want to be sure that you partition the data, and no record falls into more than one group. If you don't know what training, validation, and test sets are, then I would highly recommend that you get a book on data mining (such as "Data Mining Techniques for Marketing, Sales, and Customer Support, Third Edition" at http://www.amazon.com/Data-Mining-Techniques-Relationship-Management/dp/0470650931/ref=pd_sim_b_5).

Related

How to deal with a dataset with "periods of time" and missing data

I'm working on a dataset, which has as columns points in time (e.g. August, September, etc.) and as rows different measurements which were collected at that point.
Apart from that, the data is not clean at all, the are a lot of missing data and I just can't drop all the rows with them or filling them up so my idea was to divide the dataset in 4 smaller ones.
What kind of analysis can be performed on a dataset of this kind? Should I invert columns and rows?
A timeseries regression with missing data is a special case within statistical analysis. Simply re-jigging the data set is not the solution.
I understand periodicity analysis and spectral analysis is performed to identify the sinosoid of best fit, i.e. a sine wave is driven through the missing data points and regression is one approach in identifying the fit to the existing data.
The same question has been previously raised on Stats exchange based on ARIMA (moving average). Personally, I am not overawed by this approach because there will be a specialist solution.
https://stats.stackexchange.com/questions/121414/how-do-i-handle-nonexistent-or-missing-data

database design for large streaming data with minimal latency

Following is the scenario:
Customer places an order.
Order has type: Physical / Downloadable.
Order is placed from: Web / App.
Order is placed from a Location: UK,AUS,etc.
Can have more dimensions in future.
Consider that all of the dimensions change frequently in every order. And the data is quite huge, approximately 1.3 million records per hour.
Want to design this in a way that reports should be able able to drill down with any requested dimension for each customer.
Example:
- Customer 'A' has placed how many orders of type 'Physical' from 'AUS'
- Customer 'A' has placed how many orders in all.
- Customer 'A' has placed how many orders from of type 'Downloadable' from'APP'.
etc.
Need these reports on realtime, hence low latency writes and reads are a must. What nosql database can be a good fit. And how can this data be well structured to be able to sliced and diced in any required dimension as well as combination of more than one dimension.
If you need high performance then I would recommend ScyllaDB which can handle over 1M ops/s per node (on a good hardware). It shares data model with Cassandra so you can model and query your data using CQL. You can give it a free test drive with just couple of clicks here.
Regarding modeling: A useful technique is to model around your queries. So if you have a particular query you should prepare a table that will serve this query in most efficient way. In this technique you duplicate data by creating as many tables with the same data as many different types of queries you have. Duplicating data comes with a price so you need to trade off the performance and cost depending on your needs. You can read more about it here.

Data Store Design for NxN Data Aggregation

I am trying to come up with a theoretical solution to an NxN problem for data aggregation and storage. As an example I have a huge amount of data that comes in via a stream. The stream sends the data in points. Each point has 5 dimensions:
Location
Date
Time
Name
Statistics
This data then needs to be aggregated and stored to allow another user to come along and query the data for both location and time. The user should be able to query like the following (pseudo-code):
Show me aggregated statistics for Location 1,2,3,4,....N between Dates 01/01/2011 and 01/03/2011 between times 11am and 4pm
Unfortunately due to the scale of the data it is not possible to aggregate all this data from the points on the fly and so aggregation prior to this needs to be done. As you can see though there are multiple dimensions that the data could be aggregated on.
They can query for any number of days or locations and so finding all the combinations would require huge pre-aggregation:
Record for Locations 1 Today
Record for Locations 1,2 Today
Record for Locations 1,3 Today
Record for Locations 1,2,3 Today
etc... up to N
Preprocessing all of these combinations prior to querying could result in an amount of precessing that is not viable. If we have 200 different locations then we have 2^200 combinations which would be nearly impossible to precompute in any reasonable amount of time.
I did think about creating records on 1 dimension and then merging could be done on the fly when requested, but this would also take time at scale.
Questions:
How should I go about choosing the right dimension and/or combination of dimensions given that the user is as likely to query on all dimensions?
Are there any case studies I could refer to, books I could read or anything else you can think of that would help?
Thank you for your time.
EDIT 1
When I say aggregating the data together I mean combining the statistics and name (dimensions 4 & 5) for the other dimensions. So for example if I request data for Locations 1,2,3,4..N then I must merge the statistics and counts of name together for those N Locations before serving it up to the user.
Similarly if I request the data for dates 01/01/2015 - 01/12/2015 then I must aggregate all data between those periods (by adding summing name/statistics).
Finally If I ask for data between dates 01/01/2015 - 01/12/2015 for Locations 1,2,3,4..N then I must aggregate all data between those dates for all those locations.
For the sake of this example lets say that going through statistics requires some sort of nested loop and does not scale well especially on the fly.
Try a time-series database!
From your description it seems that your data is a time-series dataset.
The user seems to be mostly concerned about the time when querying and after selecting a time frame, the user will refine the results by additional conditions.
With this in mind, I suggest you to try a time-series database like InfluxDB or OpenTSD.
For example, Influx provides a query language that is capable of handling queries like the following, which comes quite close to what you are trying to achieve:
SELECT count(location) FROM events
WHERE time > '2013-08-12 22:32:01.232' AND time < '2013-08-13'
GROUP BY time(10m);
I am not sure what you mean by scale, but the time-series DBs have been designed to be fast for lots of data points.
I'd suggest to definitely give them a try before rolling your own solution!
Denormalization is a means of addressing performance or scalability in relational database.
IMO having some new tables to hold aggregated data and using them for reporting will help you.
I have a huge amount of data that comes in via a stream. The stream
sends the data in points.
There will be multiple ways to achieve denormalization in the case:
Adding a new parallel endpoint for data aggregation functionality in streaming
level
Scheduling a job to aggregate data in DBMS level.
Using DBMS triggering mechanism (less efficient)
In an ideal scenario when a message reaches the streaming level there will be two copies of data message containing location, date, time, name, statistics dimensions, being dispatched for processing, one goes for OLTP(current application logic) second will goes for an OLAP(BI) process.
The BI process will create denormalized aggregated structures for reporting.
I will suggest having aggregated data record per location, date group.
So end-user will query preprossed data that wont need heavy recalculations, having some acceptable inaccuracy.
How should I go about choosing the right dimension and/or combination
of dimensions given that the user is as likely to query on all
dimensions?
That will depends on your application logic. If possible limit the user for predefined queries that can be assigned values by the user(like for dates from 01/01/2015 to 01/12/2015). In more complex systems using a report generator above the BI warehouse will be an option.
I'd recommend Kimball's The Data Warehouse ETL Toolkit.
You can at least reduce Date and Time to a single dimension, and pre-aggregate your data based on your minimum granularity, e.g. 1-second or 1-minute resolution. It could be useful to cache and chunk your incoming stream for the same resolution, e.g. append totals to the datastore every second instead of updating for every point.
What's the size and likelyhood of change of the name and location domains? Is there any relation between them? You said that location could be as many as 200. I'm thinking that if name is a very small set and unlikely to change, you could hold counts of names in per-name columns in a single record, reducing the scale of the table to 1 row per location per unit of time.
you have a lot of datas. It will take a lot of time with all methods due to the amount of datas you're trying to parse.
I have two methods to give.
First one is a brutal one, you probably thought off:
id | location | date | time | name | statistics
0 | blablabl | blab | blbl | blab | blablablab
1 | blablabl | blab | blbl | blab | blablablab
ect.
With this one, you can easily parse and get elements, they are all in the same table, but the parsing is long and the table is enormous.
Second one is better I think:
Multiple tables:
id | location
0 | blablabl
id | date
0 | blab
id | time
0 | blab
id | name
0 | blab
id | statistics
0 | blablablab
With this you could parse (a lot) faster, getting the IDs and then taking all the needed informations.
It also allow you to preparse all the datas:
You can have the locations sorted by location, the time sorted by time, the name sorted by alphabet, ect, because we don't care about how the ID's are mixed:
If the id's are 1 2 3 or 1 3 2, no one actually care, and you would go a lot faster with parsing if your datas are already parsed in their respective tables.
So, if you use the second method I gave: At the moment where you receive a point of data, give an ID to each of his columns:
You receive:
London 12/12/12 02:23:32 donut verygoodstatsblablabla
You add the ID to each part of this and go parse them in their respective columns:
42 | London ==> goes with London location in the location table
42 | 12/12/12 ==> goes with 12/12/12 dates in the date table
42 | ...
With this, you want to get all the London datas, they are all side by side, you just have to take all the ids, and get the other datas with them. If you want to take all the datas between 11/11/11 and 12/12/12, they are all side by side, you just have to take the ids ect..
Hope I helped, sorry for my poor english.
You should check out Apache Flume and Hadoop
http://hortonworks.com/hadoop/flume/#tutorials
The flume agent can be used to capture and aggregate the data into HDFS, and you can scale this as needed. Once it is in HDFS there are many options to visualize and even use map reduce or elastic search to view the data sets you are looking for in the examples provided.
I have worked with a point-of-sale database with hundred thousand products and ten thousand stores (typically week-level aggregated sales but also receipt-level stuff for basket analysis, cross sales etc.). I would suggest you to have a look at these:
Amazon Redshift, highly scalable and relatively simple to get started, cost-efficient
Microsoft Columnstore Indexes, compresses data and has familiar SQL interface, quite expensive (1 year reserved instance r3.2xlarge at AWS is about 37.000 USD), no experience on how it scales within a cluster
ElasticSearch is my personal favourite, highly scalable, very efficient searches via inverted indexes, nice aggregation framework, no license fees, has its own query language but simple queries are simple to express
In my experiments ElasticSearch was faster than Microsoft's column store or clustered index tables for small and medium-size queries by 20 - 50% on same hardware. To have fast response times you must have sufficient amount of RAM to have necessary data structures loaded in-memory.
I know I'm missing many other DB engines and platforms but I am most familiar with these. I have also used Apache Spark but not in data aggregation context but for distributed mathematical model training.
Is there really likely to be a way of doing this without brute forcing it in some way?
I'm only familiar with relational databases, and I think that the only real way to tackle this is with a flat table as suggested before i.e. all your datapoints as fields in a single table. I guess that you just have to decide how to do this, and how to optimize it.
Unless you have to maintain 100% to the single record accuracy, then I think the question really needs to be, what can we throw away.
I think my approach would be to:
Work out what the smallest time fragment would be and quantise the time domain on that. e.g. each analyseable record is 15 minutes long.
Collect raw records together into a raw table as they come in, but as the quantising window passes, summarize the rows into the analytical table (for the 15 minute window).
Deletion of old raw records can be done by a less time-sensitive routine.
Location looks like a restricted set, so use a table to convert these to integers.
Index all the columns in the summary table.
Run queries.
Obviously I'm betting that quantising the time domain in this way is acceptable. You could supply interactive drill-down by querying back onto the raw data by time domain too, but that would still be slow.
Hope this helps.
Mark

Performance of 100M Row Table (Oracle 11g)

We are designing a table for ad-hoc analysis that will capture umpteen value fields over time for claims received. The table structure is essentially (pseudo-ish-code):
table_huge (
claim_key int not null,
valuation_date_key int not null,
value_1 some_number_type,
value_2 some_number_type,
[etc...],
constraint pk_huge primary key (claim_key, valuation_date_key)
);
All value fields all numeric. The requirements are: The table shall capture a minimum of 12 recent years (hopefully more) of incepted claims. Each claim shall have a valuation date for each month-end occurring between claim inception and the current date. Typical claim inception volumes range from 50k-100k per year.
Adding all this up I project a table with a row count on the order of 100 million, and could grow to as much as 500 million over years depending on the business's needs. The table will be rebuilt each month. Consumers will select only. Other than a monthly refresh, no updates, inserts or deletes will occur.
I am coming at this from the business (consumer) side, but I have an interest in mitigating the IT cost while preserving the analytical value of this table. We are not overwhelmingly concerned about quick returns from the Table, but will occasionally need to throw a couple dozen queries at it and get all results in a day or three.
For argument's sake, let's assume the technology stack is, I dunno, in the 80th percentile of modern hardware.
The questions I have are:
Is there a point at which the cost-to-benefit of indices becomes excessive, considering a low frequency of queries against high-volume tables?
Does the SO community have experience with +100M row tables and can
offer tips on how to manage?
Do I leave the database technology problem to IT to solve or should I
seriously consider curbing the business requirements (and why?)?
I know these are somewhat soft questions, and I hope readers appreciate this is not a proposition I can test before building.
Please let me know if any clarifications are needed. Thanks for reading!
First of all: Expect this to "just work" if leaving the tech problem to IT - especially if your budget allows for an "80% current" hardware level.
I do have experience with 200M+ rows in MySQL on entry-level and outdated hardware, and I was allways positivly suprised.
Some Hints:
On monthly refresh, load the table without non-primary indices, then create them. Search for the sweet point, how many index creations in parallell work best. In a project with much less date (ca. 10M) this reduced load time compared to the naive "create table, then load data" approach by 70%
Try to get a grip on the number and complexity of concurrent queries: This has influence on your hardware decisions (less concurrency=less IO, more CPU)
Assuming you have 20 numeric fields of 64 bits each, times 200M rows: If I can calculate correctly, ths is a payload of 32GB. Trade cheap disks against 64G RAM and never ever have an IO bottleneck.
Make sure, you set the tablespace to read only
You could consider anchor modeling approach to store changes only.
Considering that there are so many expected repeated rows, ~ 95% --
bringing row count from 100M to only 5M, removes most of your concerns.
At this point it is mostly cache consideration, if the whole table
can somehow fit into cache, things happen fairly fast.
For "low" data volumes, the following structure is slower to query than a plain table; at one point (as data volume grows) it becomes faster. That point depends on several factors, but it may be easy to test. Take a look at this white-paper about anchor modeling -- see graphs on page 10.
In terms of anchor-modeling, it is equivalent to
The modeling tool has automatic code generation, but it seems that it currenty fully supports only MS SQL server, though there is ORACLE in drop-down too. It can still be used as a code-helper.
In terms of supporting code, you will need (minimum)
Latest perspective view (auto-generated)
Point in time function (auto-generated)
Staging table from which this structure will be loaded (see tutorial for data-warehouse-loading)
Loading function, from staging table to the structure
Pruning functions for each attribute, to remove any repeating values
It is easy to create all this by following auto-generated-code patterns.
With no ongoing updates/inserts, an index NEVER has negative performance consequences, only positive (by MANY orders of magnitude for tables of this size).
More critically, the schema is seriously flawed. What you want is
Claim
claim_key
valuation_date
ClaimValue
claim_key (fk->Claim.claim_key)
value_key
value
This is much more space-efficient as it stores only the values you actually have, and does not require schema changes when the number of values for a single row exceeds the number of columns you have allocated.
Using partition concept & apply partition key on every query that you perform will save give the more performance improvements.
In our company we solved huge number of performance issues with the partition concept.
One more design solutions is if we know that the table is going to be very very big, try not to apply more constraints on the table & handle in the logic before u perform & don't have many columns on the table to avoid row chaining issues.

Even partitioning of nonuniform ranged data in cassandra

I've got a rather tricky one, bear with me as I try not to stumble over my words here. I'm doing some research, and my group is transitioning to a cassandra database. Our research used MySQL before, but the data outgrew the database (192 million rows in memory # 16G -- it was the only way to query the data fast enough). The data itself is kinda-sorta static. There's a whole lot of it, but any new data is a somewhat slow trickle at this point.
The data consists of a boatload of classifier-score pairs. We formulate queries for the database which basically say, "give me the top 500 for the following classifiers". Then the database returns that many scores. For example, if we ask for the top 500 scores for 2 classifiers, we get back 1000 rows (each row consisting of a classifier ID and a score -- i.e. [4, 9100]). The scores themselves are non-uniform (the distribution tends to clump toward one end of the values -- which by the way are from -10000 to 10000)
As we transition to cassandra, there are a number of requirements. First of all, we need to be able to query for the top and bottom N scores on a per-classifier basis. Normally I can see that an ordered partitioner would be appropriate for this, however like I said the scores tends to clump at the extremes (which would put too much of a burden on one node). So my first question is, how do I evenly distribute the classifier/score pairs while still being able to query for the top or bottom N.
There is a secondary requirement which pretty much screws up the first one. Sometimes it is necessary to find all scores that are near another score. So if I see classifier 6 with a score of 400, I might ask, show me 500 scores that are the closest to that (all within classifier 6). I'm absolutely stumped about this one. I've read that cassandra supports secondary indices (yay) but only hash type (boo - no ranges). Do we create a seperate ColumnFamily for this use case?
And finally, speed is paramount. The data is being used in an interactive GUI application. Ideally, queries should only take a few seconds. And if data all gets stuck on one particular node, it will slow things down.
We've tried all kinds of clever tricks. Our best idea was to put the data into buckets, so that the top 500 went into bucket 1, the next 500 went into bucket 2, and so on. The advantage is that to get the top 500 we just ask for bucket 1. Also all of the data WOULD be evenly distributed using a random partitioner. However since MOST of our queries are interested only in bucket 1, it would put a lot of burden on just one node (remember, if N classifiers are involved, it's actually 500 * N scores per bucket). The real disadvantage of this scheme is that it falls apart when we need to query based on nearness to a score (we'd have to do some kind of weird binary search over the buckets to find our starting value).
At this point we're running low on ideas. Everything I've seen about cassandra makes me wonder if it's even appropriate for this task. We chose it mainly because of it's horizontal scalability, which is important (much easier to add a node than to shard an RDBM). So I suppose my overall question is: how would you approach this? If cassandra, please address any of the above issues. Otherwise any insight or wisdom would be appreciated. Thanks.
Why not storing the classifier as a column family row key and the score in column name. Since columns are sorted it is really fast to query the top/bottom 500 columns for a given classifier. The second type of query is also possible, when you are looking for the scores near s you can for instance select 500 columns before s and 500 columns after s and then filter the 500 columns near s.

Resources