I am trying to understand how can I split my data into clusters using unsupervised learning. For example, k-means method.
I have 20 columns of data and how can it be projected on 2D surface without losing of necessary information from 18 columns?
What should I use to do that?
Any help will be appreciated.
If you are simply interested in viewing your data in 2 dimensions, consider using t-SNE. The scikit-learn python package has a great implementation you can use. However, just remember that you shouldn't cluster your data on the t-SNE output, as the space your data resides in gets sufficiently distorted in the process (only short distances are maintained, whereas longer distances are heavily altered to be either shorter or longer)
Related
I am working on an Android application that uses the Overpass API at [1]. My goal is to get all circular ways that enclose a certain lat-long point.
In order to do so I build a request for a rectangle that contains my location, then parse the response XML and run a ray-casting algorithm to filter the ways that enclose the given lat-long position. This is too slow for the purpose of my application because sometimes the response has tens or hundreds of MB.
Is there any OSM API that I can call to get all ways that enclose a certain location? Otherwise, how could I optimize the process?
Thanks!
[1] http://overpass-api.de/
To my knowledge, there is no standard API in OSM to do this (it is indeed a very uncommon usecase).
I assume you define enclose as the point representing the current location is inside the inner area of the polygon. Furthermore I assume optimizing the process might including changing the entire concept of the algorithm.
First of all, you need to define the rectangle to fetch data. For that, you need to consider that querying a too large rectangle would yield too much data. As far as I know there is no specific API to query circular ways only, and even if there is, querying a too large rectangle would probably denied by the server, because the server load would be enormous.
Server-side precomputation / prefiltering
Therefore I suggest the first optimization: Instead of querying an API that is not specifically suited for your purpose, use an offline database saved on the Android device. OsmAnd and others save the whole database for a country offline, but in your specific usecase you only need to save a pre-filtered database of circular ways.
As far as I know, only a small fraction of the ways in OSM is circular. Therefore I suggest writing a script that regularly downloads OSM dumps e.g. from Geofabrik, remove non-circular ways (e.g. you could check if the last node ID in a way is equal to the first node ID, but you'd need to check if that captures any way you would define as circular). How often you would run it depends on your usecase.
This optimization solves:
The issue of downloading a large amount of data
The issue of overloading the API with large request
The issue of not being able to request large chunks of data
If that is not suitable for your usecase, I suggest to build a simple API for that on your server.
Re-chunking the data into appriopriate grids
However, you still would need to filter a large amount of data. In order to partially solve this, I suggest the second optimization: Re-chunk your data. For example, if your current location is in Virginia, you would not need to filter circular ways that have an area not beyond Texas. Because filtering by state etc. would by highly country-dependent and difficult (CPU-intensive), I suggest to choose a grid, say e.g. 0.05 lat/lon degree (I'd choose a equirectangular projection because it's easy to calculate if you already have lat/lon coordinates).
The script that preprocessed that data shall then create one chunk of data (that could be a file, but we don't know enough about your usecase to talk about specific data strucutres) for any rectangle in the area you want to use. A circular way is included in this chunk if and only if it has at least one node that is inside the chunk area.
You would then only request / filter the specific chunk your position is currently in. Choose the chunk size appropriately for your application (preferably rather small, but that depends on numerous factors!).
This optimization solves:
Assuming most of the circular ways are quite small in terms of their bounding rectangles, you only need to filter a tiny fraction of the overall ways
IO is minimized, especially if you
Hysteretic heuristics
If the aforementioned optimizations do not sufficiently reduce your computation time, I'd suggest the third optimization that depends on how many circular ways you want to find (if you really need to find all, it won't help at all): Use hysteresis. Save the circular ways you were inside of during the last computation (assuming the new current location is near to the last location) and check them first. If your location didn't change too much, you have a high chance of hitting a way you're inside of during the first few raycasts.
Leveraging relations between different circular ways
Also, a fourth optimization is possible: There will be some circular ways that are fully enclosed in another circular way. You could code your program so that it knows about that relation and checks the inner circular way first. If this check succeeds, you automatically now that the current position is also contained in the outer circular way. I think computing the information (server-side) could be incredibly CPU-intensive and implementing it might also be a hard task, so I'd suggest to use this optimization only if not avoidable.
Tuning the parameters of these optimizations should be sufficient to decrease the CPU time needed for your computation significantly. Please feel free to comment/ask if you have further questions regarding these suggestions.
I am a researcher and my primary interest is improving sparse kernels for high performance computing. I investigate large number of parameters on many sparse matrices. I wonder whether there is a tool to manage these results. The problems that I encounter are:
Combine results of several experiments for each matrix
Version the results
Taking average, finding minimum/maximum/standard deviation of results
There are hundreds of metrics that describe the performance improvement. I want to select a couple of the metrics easily and try to find which metric correlates with the performance improvement.
Here I gave a sample small instance of my huge problem. There are three types of parameters and two values for each parameter: Row/Column, Cyclic/Block, HeuristicA/HeuristicB. So there must 8 files for the combination of these parameters. Contents of two of them:
Contents of the file RowCyclicHeuristicA.txt
a.mtx#3#5.1#10#2%#row#cyclic#heuristicA#1
a.mtx#7#4.1#10#4%#row#cyclic#heuristicA#2
b.mtx#4#6.1#10#3%#row#cyclic#heuristicA#1
b.mtx#12#5.7#10#7%#row#cyclic#heuristicA#2
b.mtx#9#3.1#10#10%#row#cyclic#heuristicA#3
Contents of the file ColumnCyclicHeuristicA.txt
a.mtx#3#5.1#10#5%#column#cyclic#heuristicA#1
a.mtx#1#5.3#10#6%#column#cyclic#heuristicA#2
b.mtx#4#7.1#10#5%#column#cyclic#heuristicA#1
b.mtx#3#5.7#10#9%#column#cyclic#heuristicA#2
b.mtx#5#4.1#10#3%#column#cyclic#heuristicA#3
I have a scheme file to describe the contents of these files. This file has a line describing type and meaning of each column in the result files:
str MatrixName
int Speedup
double Time
int RepetationCount
double Imbalance
str Parameter1
str Parameter2
str Parameter3
int ExperimentId
I need to display average Time and two types of parameters as follows: (numbers in the following table are random)
Parameter1 Parameter2
Matrix row col cyclic block
a.mtx 4.3 5.2 4.2 5.4
b.mtx 2.1 6.3 8.4 3.3
Is there an advanced and sophisticated tool that gets the scheme of the table above and generates this table automatically? Currently I have a tool written in Java to process raw files and Latex code to manipulate and display the table using pgfplotstable. However, I need one tool that is more professional. I do not want pivot tables of MS Excel.
A similar question is here.
Manipulating large amounts of data in an unknown format is...challenging for a generic program. Your best bet is probably similar to what you're doing already. Use a custom program to reformat your results into something easier to handle (backend), and a visualisation program of your choice to let you view and play around with the data (frontend).
Backend
For your problem I'd suggest a relational database (e.g.Mysql). Has a longer setup time than other options, but if this is an ongoing problem it should be worthwhile, as it allows you to easily pull fields of interest.
SELECT AVG(Speedup) FROM results WHERE Parameter1="column" AND Parameter2="cyclic" for example. You'll then still need a simple script to insert your data in the first place, and then to pull the results of interest in a useful format you can stick into your viewer. Or if you so desire you can just run queries directly against the db.
Alternatively, what I usually use is just Python or Perl. Read in your data files, strip the data you don't want, rearrange into the desire structure, and write out to some standard format your frontend and use. Replace Python/Perl with the language of your choice.
Frontend
Personally, I almost always use Excel. The backend does most of the heavy lifting, so I get a csv file with the results I care about all nicely ordered already. Excel then lets me play around with the data, doing stuff like taking averages, plotting, reordering, etc fairly simply.
Other tools I use to display stuff which are probably not useful for you, but included for completeness include:
Weka - Mostly machine learning targetted, but provides tools for searching for trends or correlations. Useful to play around with data looking for things of interest.
Python/IDL/etc - For when I need data that can't be represented by a spreadsheet. These programs can, in addition to doing the backend's job of extracting and bulk manipulations, generate difference images, complicated graphs, or whatever else I need.
I am working on a freelance project that captures an audio file, runs some fourier analysis, and spits out three charts (x-y plots). Each chart has about ~3000 data points, which I plan to display with High Charts in the browser.
What database techniques do you recommend for storing and accessing this much data? Should I be storing the points in an array or in multiple rows? I'm considering Mongo too. Plan is to use Rails, so I was hoping to use a single database for both data and authentication.
I haven't dealt with queries accessing this much data for a single page, and this may very well be a tiny overall amount of data. In addition this is an MVP for demonstration to investors, so making it scalable to huge levels isn't of immediate concern.
My initial thought is that using Postgres and having one large table of data points, stored per-row, will be fine, and that that a bunch of doubles is not going to be too memory-intensive relative to images and such.
Realistically, I may just pull 100 evenly-spaced data points to make the chart, but the original data must still be stored.
I've done a lot of Mongo work and I can tell you what I would do if I were you.
One of the very nice properties about your data is that the x,y coordinates are of a fixed size generally. In other words it's not like you are storing comments from users, which can vary greatly in size.
With Mongo I would first make a sample document with the 3,000 points. Just a simple array of x,y points. I would see how big that document is and how my front end handled it - in other words can High Charts handle that?
I would also try to stick to the easiest conceptual model to manage, which is one document per chart, each chart having 3k points. This is a natural way to think of the data and I would start there and see if there were any performance hits. Mongo can easily store those documents, so I think the biggest pain would be in the UI with rendering the data.
Mongo would handle authentication well. I think it's a good choice for general data storage for an MVP.
I'm implementing a nonlinear SVM and I want to test my implementation on a simple not linearly separable data. Google didn't help me find what I want. Can you please advise me where I can find such data. Or at least, how can I generate such data manually ?
Thanks,
Well, SVMs are two-class classifiers--i.e., these classifiers place data on either side of a single decision boundary.
Therefore, i would suggest a data set comprised of just two classes (that's not strictly necessary because of course an SVM can separate more than two classes by passing the Classifier multiple times (in series) over the data, it's cumbersome to do this during initial testing).
So for instance, you can use the iris data set, linked to in Scott's answer; it's comprised of three classes, Class I is linear separable from Class II and III; Class II and III are not linear separable. If you want to use this data set, for convenience-sake you might prefer to remove Class I (approx. the first 50 data rows), so what remains is a two-class system, in which the two remaining classes are not linearly separable.
The iris data set is quite small (150 x 4, or 50 rows/class x four features)--depending where you are with your SVM prototype testing, this might be exactly what you want, or you might want a larger data set.
An interesting family of data sets that are comprised of just two classes and that are definitely non-linearly separable are the the anonymized data sets supplied by the mega-dating site eHarmony (no affiliation of any kind). In addition to the iris data, I like to use these data sets for SVM prototype evaluation because they are large data sets with quite a few features yet still comprised of just two non-linearly separable classes.
I am aware of two places from which you can retrieve this data. The first Site has a single data set (PCI Code downloads, chapter9, matchmaker.csv) comprised of 500 data points (row) and six features (columns). Although this set is simpler to work with, the data is more or less in a 'raw' form and will require some processing before you can use it.
The second source for this data, contains two eHarmony data sets, one of them is comprised of over half million rows and 59 features. In addition, these two data sets have undergone substantial processing such that the only task required before feeding them to your SVM is routine rescaling of the features.
The particular data set you need will depend highly on your choice of kernel function, so It seems the easiest method is simply creating a toy data set yourself.
Some helpful ideas:
Concentric circles
Spiral-shaped classes
Nested banana-shaped classes
If you just want a random data set which is not linearly separable, may I suggest the Iris dataset? It is a multivariate data set where at least a couple of the classes in question are not linearly separable.
Hope this helps!
You can start with simple datasets like Iris or two-moons both of which are linearly non-separable. Once you are satisfied, you can move on to bigger datasets from the UCI ML repository, classification datasets.
Be sure to compare and benchmark against standard SVM solvers like libSVM and SVM-light.
If you program in Python, you can use a few functions in the package of sklearn.datasets.samples_generator to manully generate nested moon-shape data set, concentric circle data set etc. Here is a page of plots of these data sets.
And if you don't want to generate data set manually, you can refer to this website, where in the seciton of "shape sets", you can download these data set and test on them directly.
I'm trying to see if anyone knows how to cluster some Lat/Long results, using a database, to reduce the number of results sent over the wire to the application.
There are a number of resources about how to cluster, either on the client side OR in the server (application) side .. but not in the database side :(
This is a similar question, asked by a fellow S.O. member. The solutions are server side based (ie. C# code behind).
Has anyone had any luck or experience with solving this, but in a database? Are there any database guru's out there who are after a hawt and sexy DB challenge?
please help :)
EDIT 1: Clarification - by clustering, i'm hoping to group x number of points into a single point, for an area. So, if i say cluster everything in a 1 mile / 1 km square, then all the results in that 'square' are GROUP'D into a single result (say ... the middle of the square).
EDIT 2: I'm using MS Sql 2008, but i'm open to hearing if there are other solutions in other DB's.
I'd probably use a modified* version of k-means clustering using the cartesian (e.g. WGS-84 ECF) coordinates for your points. It's easy to implement & converges quickly, and adapts to your data no matter what it looks like. Plus, you can pick k to suit your bandwidth requirements, and each cluster will have the same number of associated points (mod k).
I'd make a table of cluster centroids, and add a field to the original data table to indicate what cluster it belonged too. You'd obviously want to update the clustering periodically if your data is at all dynamic. I don't know if you could do that with a stored procedure & trigger, but perhaps.
*The "modification" would be to adjust the length of the computed centroid vectors so they'd be on the surface of the earth. Otherwise you'd end up with a bunch of points with negative altitude (when converted back to LLH).
If you're clustering on geographic location, and I can't imagine it being anything else :-), you could store the "cluster ID" in the database along with the lat/long co-ordinates.
What I mean by that is to divide the world map into (for example) a 100x100 matrix (10,000 clusters) and each co-ordinate gets assigned to one of those clusters.
Then, you can detect very close coordinates by selecting those in the same square and moderately close ones by selecting those in adjacent squares.
The size of your squares (and therefore the number of them) will be decided by how accurate you need the clustering to be. Obviously, if you only have a 2x2 matrix, you could get some clustering of co-ordinates that are a long way apart.
You will always have the edge cases such as two points close together but in different clusters (one northernmost in one cluster, the other southernmost in another) but you could adjust the cluster size OR post-process the results on the client side.
I did a similar thing for a geographic application where I wanted to ensure I could cache point sets easily. My geohashing code looks like this:
def compute_chunk(latitude, longitude)
(floor_lon(longitude) * 0x1000) | floor_lat(latitude)
end
def floor_lon(longitude)
((longitude + 180) * 10).to_i
end
def floor_lat(latitude)
((latitude + 90) * 10).to_i
end
Everything got really easy from there. I had some code for grabbing all of the chunks from a given point to a given radius that would translate into a single memcache multiget (and some code to backfill that when it was missing).
For movielandmarks.com I used the clustering code from Mike Purvis, one of the authors of Beginning Google Maps Applications with PHP and AJAX. It builds trees of clusters/points for different zoom levels using PHP and MySQL, storing it in the database so that recall is very fast. Some of it may be useful to you even if you are using a different database.
Why not testing multiple approaches?
translate the weka library in .NET CLI with IKVM.NET
add an assembly resulted from your code and weka.dll (use ilmerge) into your database
Make some tests, that is. No specific clustering works better than anyone else.
I believe you can use MSSQL's spatial data types. If they are similar to other spatial data types I know, they will store your points in a tree of rectangles, and then you can go to the lower-resolution rectangles to get implicit clusters.
If you end up wanting to explore Geohash's (which were invented at exactly the same time you posted this question), here's a more fleshed-out implementation of Geohash related functions for SQL Server's TSQL in which you might be interested.
QalGeohash-TSQL
I have used the Integer version of the Geohash extensively to cluster results to reduce data sent to a client for a limited viewport.