I am working on a freelance project that captures an audio file, runs some fourier analysis, and spits out three charts (x-y plots). Each chart has about ~3000 data points, which I plan to display with High Charts in the browser.
What database techniques do you recommend for storing and accessing this much data? Should I be storing the points in an array or in multiple rows? I'm considering Mongo too. Plan is to use Rails, so I was hoping to use a single database for both data and authentication.
I haven't dealt with queries accessing this much data for a single page, and this may very well be a tiny overall amount of data. In addition this is an MVP for demonstration to investors, so making it scalable to huge levels isn't of immediate concern.
My initial thought is that using Postgres and having one large table of data points, stored per-row, will be fine, and that that a bunch of doubles is not going to be too memory-intensive relative to images and such.
Realistically, I may just pull 100 evenly-spaced data points to make the chart, but the original data must still be stored.
I've done a lot of Mongo work and I can tell you what I would do if I were you.
One of the very nice properties about your data is that the x,y coordinates are of a fixed size generally. In other words it's not like you are storing comments from users, which can vary greatly in size.
With Mongo I would first make a sample document with the 3,000 points. Just a simple array of x,y points. I would see how big that document is and how my front end handled it - in other words can High Charts handle that?
I would also try to stick to the easiest conceptual model to manage, which is one document per chart, each chart having 3k points. This is a natural way to think of the data and I would start there and see if there were any performance hits. Mongo can easily store those documents, so I think the biggest pain would be in the UI with rendering the data.
Mongo would handle authentication well. I think it's a good choice for general data storage for an MVP.
Related
I am trying to understand how can I split my data into clusters using unsupervised learning. For example, k-means method.
I have 20 columns of data and how can it be projected on 2D surface without losing of necessary information from 18 columns?
What should I use to do that?
Any help will be appreciated.
If you are simply interested in viewing your data in 2 dimensions, consider using t-SNE. The scikit-learn python package has a great implementation you can use. However, just remember that you shouldn't cluster your data on the t-SNE output, as the space your data resides in gets sufficiently distorted in the process (only short distances are maintained, whereas longer distances are heavily altered to be either shorter or longer)
Many times I read that data in NOSQL databases is stored denormalized. For instance consider a chess game record. It may not only contain the player id's that participate in the chess game, but also the first and lastname of that player. I suppose this is done because joins are not possible in NOSQL, so if you just duplicate data you can still retrieve all the data you want in one call without manual application level processing of the data.
What I don't understand is that now when you want to update a chess-player's name, you will have to write a query that updates both the chess-game records in which that player participates as well as the player record of that player. This seems like a huge performance overhead as the database will have to search all games where that player participates in and then update each of those records.
Is it true that data is often stored denormalized like in my example?
You are correct, the data is often stored de-normalized in NoSQL databases.
The problem with the updates is partially where the term "eventual consistency" comes from.
In your example, when you update the player's name (not a common event, but it can happen), you would issue a background job to update the name across all other records. Yes, while the update is happening you may retrieve an older value, but eventually the data will be consistent. Since we're not writing ATM software here, the performance/consistency tradeoff is acceptable.
You can find more info here: http://www.allbuttonspressed.com/blog/django/2010/09/JOINs-via-denormalization-for-NoSQL-coders-Part-2-Materialized-views
One way to look at it is that the number of times the user changes his/her name is extremely rare.
But the number of times that board data is read and changed is immense.
So it only makes sense to optimize for a case that will happen so much more times than a case that's only happening ever so rarely.
Another point to note is that by not keeping that name data duplicated under board data, you are actually increasing the performance overhead of the read. Every time you fetch the board data, you'd have to go one more step ahead and fetch all the user data too (even if all you really wanted was just first and last name).
Again the reason to put that first name and last name on board data is probably that on the screen where the board data will be shown, you'll often be showing the user's name too.
For these reasons, you are spared to have duplicate data on NoSQL DBs. (Although this can be done in SQL DBs too but mind ya, you'll be frowned upon). Duplication in NoSQL world is fairly common and is promoted too.
I have been working for the past 7 years with NoSQL (Firestore) for 2 fairly big projects where I was able to write code from scratch (both around 50k LoC and one has about 15k daily active users). I didn't use denormalization at all. The concept never appealed to me, and document reads are fairly cheap in Firestore.
To come back to your example; loading the other data for the chess game seems way more important than instantly being able to show the name. I would load the name based on the user id in the background and put a simple client-side memoize / cache around it to prevent fetching the same user document over and over.
What I did use quite a bit to solve performance issues is generate derived data. I would set a listener on a database document "onWrite" and then store some computed data in another derived document. These documents would automatically update when the source changes, so it doesn't complicate things really. In the case of a chess game, a distilled document could be the leaderboard that is constantly shown to all users of the app.
Another optimization I had to do was to distill a long list of titles + metadata for recently opened "projects". Firestore on the web client side doesn't give the ability to select fields from a document in a query. It only fetches full documents and that was too much data for the list, so we solved this by making an API endpoint to fetch the distilled data through there.
I'm not saying you should follow my advice, but we seem to be doing well in terms of code complexity and database costs. So when I read that NoSQL requires data denormalization I become skeptical :)
That's my 2 cents.
I am working on an Android application that uses the Overpass API at [1]. My goal is to get all circular ways that enclose a certain lat-long point.
In order to do so I build a request for a rectangle that contains my location, then parse the response XML and run a ray-casting algorithm to filter the ways that enclose the given lat-long position. This is too slow for the purpose of my application because sometimes the response has tens or hundreds of MB.
Is there any OSM API that I can call to get all ways that enclose a certain location? Otherwise, how could I optimize the process?
Thanks!
[1] http://overpass-api.de/
To my knowledge, there is no standard API in OSM to do this (it is indeed a very uncommon usecase).
I assume you define enclose as the point representing the current location is inside the inner area of the polygon. Furthermore I assume optimizing the process might including changing the entire concept of the algorithm.
First of all, you need to define the rectangle to fetch data. For that, you need to consider that querying a too large rectangle would yield too much data. As far as I know there is no specific API to query circular ways only, and even if there is, querying a too large rectangle would probably denied by the server, because the server load would be enormous.
Server-side precomputation / prefiltering
Therefore I suggest the first optimization: Instead of querying an API that is not specifically suited for your purpose, use an offline database saved on the Android device. OsmAnd and others save the whole database for a country offline, but in your specific usecase you only need to save a pre-filtered database of circular ways.
As far as I know, only a small fraction of the ways in OSM is circular. Therefore I suggest writing a script that regularly downloads OSM dumps e.g. from Geofabrik, remove non-circular ways (e.g. you could check if the last node ID in a way is equal to the first node ID, but you'd need to check if that captures any way you would define as circular). How often you would run it depends on your usecase.
This optimization solves:
The issue of downloading a large amount of data
The issue of overloading the API with large request
The issue of not being able to request large chunks of data
If that is not suitable for your usecase, I suggest to build a simple API for that on your server.
Re-chunking the data into appriopriate grids
However, you still would need to filter a large amount of data. In order to partially solve this, I suggest the second optimization: Re-chunk your data. For example, if your current location is in Virginia, you would not need to filter circular ways that have an area not beyond Texas. Because filtering by state etc. would by highly country-dependent and difficult (CPU-intensive), I suggest to choose a grid, say e.g. 0.05 lat/lon degree (I'd choose a equirectangular projection because it's easy to calculate if you already have lat/lon coordinates).
The script that preprocessed that data shall then create one chunk of data (that could be a file, but we don't know enough about your usecase to talk about specific data strucutres) for any rectangle in the area you want to use. A circular way is included in this chunk if and only if it has at least one node that is inside the chunk area.
You would then only request / filter the specific chunk your position is currently in. Choose the chunk size appropriately for your application (preferably rather small, but that depends on numerous factors!).
This optimization solves:
Assuming most of the circular ways are quite small in terms of their bounding rectangles, you only need to filter a tiny fraction of the overall ways
IO is minimized, especially if you
Hysteretic heuristics
If the aforementioned optimizations do not sufficiently reduce your computation time, I'd suggest the third optimization that depends on how many circular ways you want to find (if you really need to find all, it won't help at all): Use hysteresis. Save the circular ways you were inside of during the last computation (assuming the new current location is near to the last location) and check them first. If your location didn't change too much, you have a high chance of hitting a way you're inside of during the first few raycasts.
Leveraging relations between different circular ways
Also, a fourth optimization is possible: There will be some circular ways that are fully enclosed in another circular way. You could code your program so that it knows about that relation and checks the inner circular way first. If this check succeeds, you automatically now that the current position is also contained in the outer circular way. I think computing the information (server-side) could be incredibly CPU-intensive and implementing it might also be a hard task, so I'd suggest to use this optimization only if not avoidable.
Tuning the parameters of these optimizations should be sufficient to decrease the CPU time needed for your computation significantly. Please feel free to comment/ask if you have further questions regarding these suggestions.
Imagine I have a map shape file (.shp) or osm xml, I'm able to see different kind of data from different layers in GIS oriented programs, e.g. ArcGIS, QGIS etc. But how can I get this info programmatically? Is there a specific library for that?
What I'm really looking for is a some kind of method getMapData(longitude, latitude) to get landscape/terrain info (e.g. forest, river, city, highway) in specified location
Thanks in advance for your answers!
It still depends what you want to achieve whether you are better off using raster or vector data.
If your are using your grid to subdivide an area as an array of containers for geographic features, then stick with vector data. To do this, I would create a polygon grid file and intersect it with each of your data layers. You can then add an ID field that represents the cell's location in the array (and hence it's relative position to a known lat/long coordinate - let's say lower left). Alternatively you can use spatial queries to access your data by selecting a polygon in your vector grid file and then finding all the features in your other file that are contained by it.
OTOH, if you want to do some multi-feature analysis based on presence/abscence then you may be better going down the route of raster analysis. My gut feeling from what you have said is that this is what you are trying to achieve but I am still not 100% sure. You would handle this by creating a set of boolean rasters of a suitable resolution and then performing maths operations on the set (add, subtract, average etc - depending on what questions your are asking).
Let's say you are looking at animal migration. Let's say your model assumes that streams, hedges and towns are all obstacles to migration but roads only reduce the chance of an area being crossed. So you convert your obstacles to a value of '1' and NoData to '0' in each case, except roads where you decide to set the value to 0.5. You can then add all your rasters together in one big stack and predict migration routes.
Ok that's a simplistic example but perhaps you can see why we need EVEN more information on what you are wanting to do.
Shapefiles or an osm xml file are just containers that hold geometric shapes. There are plenty of software libraries out there that let you read these files and extract the data. I would recommend looking at GDAL/OGR as a starting point.
A method like getMapData(longitude, latitude) is essentially a search/query function. You need to be a little more specific too, do you want geometries that contain the point, are within a distance of a point, etc?
You could find the map data using a brute force algorithm
for shape in shapefile:
if shape.contains(query_point):
return shape
Or you can use more advanced algorithms/data structures such as RTrees, KDTrees, QuadTrees, etc. The easiest way to get start with querying map data is to load it into a spatial database. I would recommending investigating PostgreSQL+PostGIS and SpatiaLite
You may also like to look at Spatialite and/or PostGIS which are two spatial enabled databses that you could use separately or in conjunction with GDAL/OGR.
I must echo Charles' request that you explain your use-case in more detail because the actual implementation will depend greatly on exactly what you are wanting to achieve. My reading of this is that you may want to convert your data into a series of aligned rasters which you can overlay and treat as a 3 dimensional array.
I'm trying to see if anyone knows how to cluster some Lat/Long results, using a database, to reduce the number of results sent over the wire to the application.
There are a number of resources about how to cluster, either on the client side OR in the server (application) side .. but not in the database side :(
This is a similar question, asked by a fellow S.O. member. The solutions are server side based (ie. C# code behind).
Has anyone had any luck or experience with solving this, but in a database? Are there any database guru's out there who are after a hawt and sexy DB challenge?
please help :)
EDIT 1: Clarification - by clustering, i'm hoping to group x number of points into a single point, for an area. So, if i say cluster everything in a 1 mile / 1 km square, then all the results in that 'square' are GROUP'D into a single result (say ... the middle of the square).
EDIT 2: I'm using MS Sql 2008, but i'm open to hearing if there are other solutions in other DB's.
I'd probably use a modified* version of k-means clustering using the cartesian (e.g. WGS-84 ECF) coordinates for your points. It's easy to implement & converges quickly, and adapts to your data no matter what it looks like. Plus, you can pick k to suit your bandwidth requirements, and each cluster will have the same number of associated points (mod k).
I'd make a table of cluster centroids, and add a field to the original data table to indicate what cluster it belonged too. You'd obviously want to update the clustering periodically if your data is at all dynamic. I don't know if you could do that with a stored procedure & trigger, but perhaps.
*The "modification" would be to adjust the length of the computed centroid vectors so they'd be on the surface of the earth. Otherwise you'd end up with a bunch of points with negative altitude (when converted back to LLH).
If you're clustering on geographic location, and I can't imagine it being anything else :-), you could store the "cluster ID" in the database along with the lat/long co-ordinates.
What I mean by that is to divide the world map into (for example) a 100x100 matrix (10,000 clusters) and each co-ordinate gets assigned to one of those clusters.
Then, you can detect very close coordinates by selecting those in the same square and moderately close ones by selecting those in adjacent squares.
The size of your squares (and therefore the number of them) will be decided by how accurate you need the clustering to be. Obviously, if you only have a 2x2 matrix, you could get some clustering of co-ordinates that are a long way apart.
You will always have the edge cases such as two points close together but in different clusters (one northernmost in one cluster, the other southernmost in another) but you could adjust the cluster size OR post-process the results on the client side.
I did a similar thing for a geographic application where I wanted to ensure I could cache point sets easily. My geohashing code looks like this:
def compute_chunk(latitude, longitude)
(floor_lon(longitude) * 0x1000) | floor_lat(latitude)
end
def floor_lon(longitude)
((longitude + 180) * 10).to_i
end
def floor_lat(latitude)
((latitude + 90) * 10).to_i
end
Everything got really easy from there. I had some code for grabbing all of the chunks from a given point to a given radius that would translate into a single memcache multiget (and some code to backfill that when it was missing).
For movielandmarks.com I used the clustering code from Mike Purvis, one of the authors of Beginning Google Maps Applications with PHP and AJAX. It builds trees of clusters/points for different zoom levels using PHP and MySQL, storing it in the database so that recall is very fast. Some of it may be useful to you even if you are using a different database.
Why not testing multiple approaches?
translate the weka library in .NET CLI with IKVM.NET
add an assembly resulted from your code and weka.dll (use ilmerge) into your database
Make some tests, that is. No specific clustering works better than anyone else.
I believe you can use MSSQL's spatial data types. If they are similar to other spatial data types I know, they will store your points in a tree of rectangles, and then you can go to the lower-resolution rectangles to get implicit clusters.
If you end up wanting to explore Geohash's (which were invented at exactly the same time you posted this question), here's a more fleshed-out implementation of Geohash related functions for SQL Server's TSQL in which you might be interested.
QalGeohash-TSQL
I have used the Integer version of the Geohash extensively to cluster results to reduce data sent to a client for a limited viewport.