How do apps like Mapbox, AllTrails and Maps.me use and display ALL of OSM data? When all the resources say that's a huge amount of data - maps

I started exploring Overpass Turbo and Mapbox with hopes of building my travel app. I can query some data in OT and get towns or islands, no problem, I understand the whole process of querying and exporting as Geojson.
But for learning purposes, I always do queries within a small area so I don't get too much data back.
Also, various resources mention that OSM data for the whole planet is huge, like here: https://wiki.openstreetmap.org/wiki/Downloading_data it says: The entire planet is a huge amount of data. Start with a regional extract to make sure your setup works properly. Common tools like Osmosis or various import tools for database imports and converters take hours or days to import data, depending largely on disk speed.
But when I go to apps like AllTrails, Maps.me or Mapbox, they seem to be showing a huge amount of data, definitely the main POIs.
here's an example screenshot from All Trails
Can someone briefly explain how is this done then? Do they actually download all of data? Or little by little depending on the current bounding box. Any info I can research further, I'd appreciate it!
Thanks
P.S. I am hoping to build my app with Node.js, if that makes a difference.

Several reasons:
They don't always display everything. You will always only see a limited region, never the whole world in full detail. If you zoom in, you will see a smaller region but with more details. If you zoom out, you will see a larger region but with reduced details (less or no POIs, smaller roads and waterways disappear etc.).
They don't contain all the available data. OSM data is very diverse. OSM contains roads, buildings, landuse, addresses, POI information and much more. For each of the mentioned elements, there is additional information available. Roads for instance have maxspeed information, lane count, surface information, whether they are lit and if they have sidewalks or cycleways etc. Buildings may have information about the number of building levels, the building color, roof shape and color and so on. Not all of these information are required for the apps you listed and thus can be removed from the data.
They perform simplifications. It isn't always necessary to show roads, buildings, waterways and landuse in full detail. Instead, special algorithms reduce the polygon count so that the data becomes smaller while keeping sufficient details. This is often coupled with the zoom level, i.e. roads and lakes will become less detailed if zoomed out.
They never ship the whole world offline. Depending on the app, the map is either online or offline available, or both. If online, the server has to store the huge amount of data, not the client device. If offline, the map is split into smaller regions that can be handled by the client. This usually means that a certain map only covers a state, a certain region or sometimes a city but rarely a whole country except for smaller countries. If you want to store whole countries offline you will need a significant amount of data storage.
They never access OSM directly. All apps and websites that display OSM maps don't obtain this information live from OSM. Instead, they either already have a local database containing the required data. This database is periodically updated from the main OSM database via planet dumps. Or they use a third-party map provider (such as MapBox from your screenshot) to display a base map with layers on top. In this case they don't have to store much information on their server, just the things they want to show on top of OSM.
None of the above is specifically for OSM. You will find similar mechanisms in other map apps and for other map sources.

Related

What is the best way to represent data lineage in an image processing pipeline?

I am trying to determine the best way to represent data lineage for image processing. I have a images stored in S3 and I want to process them and then place them back in S3. I would then want to be able to run a query so I can see all the images and processes before and after in a chain. For example:
Image1 -ProcessA-> Image2 -ProcessB-> Image3
I would expect a search for the "lineage" of Image2 would yield the above information.
I know this looks like a cookie-cutter case for a graph database but I am not super familiar with them, especially for a production workflow. I have been fighting with how to implement this model in a relational database, but feel like I am just trying to put the square peg in the round hole.
Is a graph DB the only option? Which flavor would you suggest?
Is there a way to make this work in a relational model that I have not considered?
You are correct when you say this is a cookie-cutter case for a graph database, and any of the available graph database products will likely be able to meet your requirements. You can also solve this problem using a relational database but, as you indicated, it would be like putting a square peg in round hole.
Disclosure: I work for Objectivity, maker of the InfiniteGraph product.
I have solved similar data lineage problems using InfiniteGraph. The basic idea is to separate your data from your metadata. The "lineage" information is metadata. Let's put that in the graph database. The lineage information will include objects (nodes) that contain the metadata for images and the workflow process steps that consume images as input and generated images or other information as output.
We might define an ImageMD type in Infinite graph to contain the metadata for an image, including a URI that defines where the image data is currently stored, and the size and format of the image. We might define the ProcessMD type to describe an application that operates on image. It's attributes might include the name and version of the application as well as it deployment timestamp and host location where it is running.
You are going to end up with an environment that looks something like the following diagram.
Then, given an image, you can track its lineage backward to see its history and forward to see how it or it derivative components were evolved or used.
This is the basis for the Objectivity, Inc. application Metadata Connect.

what is a best approach to reduce the redux state size?

I have a new project about mobile app using react native tech.
I am thinking about using redux to manage the whole data from remote server api. our product have more business data need to display in mobile app.
So, My question is: redux state store our business data and it will take more memory on mobile device, like a ListView component. how can i solve this problem if i want to reduce the memory usage?
I am choosing, based on your background description of what you're trying to do, to address the underlying concern about the size of your redux store generally and the approach of storing everything on the client in my answer, and will not address specifically how to actually reduce the size of your data store here (the only answer to that is simply "don't store so much").
This is just a total swag and ignores things like compression, data duplication, the difference between storing something in AsyncStorage vs simply being in memory, etc.
That having been said, if you need some sort of gut check on whether memory/storage will be an issue, take a representative chunk of record data served by your API, serialize it as a JSON string, and figure out how big it is.
For example, this example twitter response is roughly 8.5 KB with whitespace removed. Let's say 10KB for each individual record for simplicity.
Now, how many records do you plan on bringing down? 10? 100? 1000? Let's say 1000 records of this type. That's 10,000KB or roughly 10MB.
With the constructs here, 10 MB is (Edit: depending on the specific constraint you're concerned about, may or may not be) a trivial amount of memory/storage to use in your application.
You need to do this similar process to your particular use case, and see if the amount of data you wish to store will be a problem for the devices you have to support.
A more relevant thing to consider is the performance impact of churning through large quantities of data on a single thread to do things like data manipulation, joining/merging, etc if that will be a need.
Redux is a tiny library that doesn't actually do that much for you by itself. This consideration is a general one, and is totally unique to your own application and cannot be concretely answered.

Shopify - data model

One of my customer requested some changes on shopify site. She sell pictures and she would like to start offer the frames.
But the whole administration of the frames will be so complicated that I know now, that I will need to somehow extend the data model, because I will need to store some additional relations.
So my question is: Is it somehow possible to store any kind of data by shopify API? Like create a new entities with custom attributes etc. I was searching through the API documentation but I was not able to find any solution.
Second question is: Would it be possible to solve this problem by Embedded App? Thats mean, that I will develop the whole administration part as a small application and then embed it to the shopify? Will it be possible then join data from shopify storage and my database through Shopify API?
Is there some example for this scenario?
Thanks you for your help.
I built a Shopify store selling frames when Shopify first came out. That shop still sells many many frames. There was not even a Shopify API back then. Still, it was possible to do it with 100% client-side code. You simply price each frame in units. For example, $1 per inch. Client-side you collect the frame size as length x width, and come up with the total inches needed, and there is your quantity. Use mm, cm whatever units work for you. You can even get fancy if you're good, and work in mats, backing, and types of glass, all with one add to cart click.

Image Format for Large Storage in relation to Nature of Storage system

Now, I have read these questions which may have a relation with this question: Scalable Image Storage, Large scale image storage, https://serverfault.com/q/95444.
The following things i have found out, before i ask my question:
1. Facebook uses Haystack (something CLOSED-SOURCE to the open-source world) which is very efficient. Its a form of File system storage, engineered for speed and large metadata management.2. Any Operating System has a file limit in directories and may start to perform extremely poorly when this limit is being exceeded.3. Most NoSQL developers, have found it easy to use CouchDB / CouchBase Server to handle images as it handles it as an attachment, glued to a document (record in the database). However, still, this is file system storage.4. HDFS, NFS, ZFS, are all File systems that may make it easy to handle large distributed data. However, at applications like facebook, they could not help5. Any proper form of caching is very essential to highly Image dependent applications6. Some PHP developers (mostly) have used MySQL to keep image meta-data while creating folders and sub-folders (matching the meta-info) on the file system. Each image will have a random hash name in relation to the meta-data in the database to enable fast location on the file system
After understanding these statements and many more others, i have come to realise that its very expensive to keep billions of constantly growing number of Images on the file system. If any one were to use Cloud storage like Amazon S3, it would kill the business because of the high image traffic as well as storage from your application. I have evaluated the use of CouchBase Server, managing images as attachments. However, for an image growing application, this is also a file system storage and i wonder how Couch base would behave if, hundreds/thousands of people are accessing images at the same time. I could use Cloudant/Big Couch which has auto-sharding/load balancing. The main point remains that the NoSQL solution would as well be keeping images on the file system and when the images are being requested for at a high concurrent rate, this might bring the whole service down (images can be heavy).My Thinking I am thinking of managing my images as SVG format. This is because, i think that i can treat this SVG data as text in my storage. Now, most NoSQL databases have a size limit on the document (record) size atleast not greater than 4MB (not sure). This presents a problem, because SVG file can even reach 6-10MB depending on the image. So, i think i cannot use Couch base server for SVG storage. Besides, the nature of the application is such that, the image data keeps growing and never archived/ never removed: and couch base is not good for such data (highly persistent and unchanging data).This brings me back to RDBMS (especially Oracle) which are known for good text compression. If i get SVG data plus its meta data and store it as a BLOB in an Oracle Database, i have a feeling that this could work. I have heard that an Oracle Table can even grow to terabytes, probably with partitioning or some-kind of fragmentation. But the whole point is that, for an oracle table to reach 20GB, containing text, i think this would be a lot of data. Now, my questions arise from all the above findings:1. Why do developers keep choosing File System storage of images as opposed to SVG, which in my (probably naive) thinking, is that SVG can be handled as Text, hence can be compressed, encrypted, digested, split, easilly stored e.t.c. ? 2. What complexities are there when an application works with images entirely as SVG, serving SVG to browsers instead of actually image files ? 3. Which is technically more memory disturbing to a Webserver: Serving images read from file system (.png, .jpg, .gif) and serving images as SVG (probably from a Database, or from a middle tier) especially under heavy loads, an example scenario of Facebook ? 4. SVG seems to not loose quality when rendered under different "Zooms" or Resolutions, why still, haven't developers worked with SVG alot in image dynamic applications ? i mean, is there any known loss of quality in converting from PNG, JPG or GIF to SVG ? 5. Is my view of using RDBMS like Oracle/MySQL Cluster very naive, for storing highly persistent meta-data as well as the persistent SVG data ?
Please highlight, and give your suggestions about large image storage formats. Thanks
EDIT / UPDATE
There are tools like Image Magick which offer command line option for manipulating images. The most important idea i need probably is this: Can CouchBase Server (whether single server or version 2.0 capable of serving Images at "user-experience acceptable performance" or at a "Social Network Scale" ?)
On databases
What is file but a data and what is file system but a database? Records in database, file on file systems, keys and values in your KV-stores - those are all fruits of the same tree.
Plain file systems were developed over decades to serve purposes of delivering files locally - on top of that you can build a distribution model.
Things like HDFS include distribution as part of file system itself but force an unnecessary overhead when you try to work with files locally.
Things like relational databases or KV-stores might help you laying out your diagrams or storing painlessly more bits of metadata but unless they were specifically designed to work as file storage systems - they gonna fail at it.
Picking storage system is all about tradeoffs and it's up to you to figure out what is best solution to your problem. And chances are that your problems are not even close to facebook's problems. Few servers with cdn on top of them and you gonna be fine.
On file format
SVGs won't work for regular pictures, don't even dream about it.
On a large scale you want to do minimum amount of transformations when you accept files: rescale/compress/crop image if it's not fitting your requirements and store it. Unless you're doing some magic on those images you don't want to convert them into different formats or compress them without real need for it.
On a large scale you want you file to be(ordered by priority):
served from client's cache
served from OS cache / memory
served from file system directly
First, I want to mention that your understanding of image file formats may be naive, since you don't provide a lot of details. How do you intend to store (for example) PNG images "as SVG format"?
I can't answer all of your questions, but I'll make the attempt.
"file system or SVG" is a false dichotomy, it's easily possible to store JPG blobs in a database, or SVG files on file-system storage. You can handle any of the bitmap image formats as text too. If you want an example, try opening up a PostScript file with embedded bitmap data. Your question of "why not" implies that the two are interchangeable, and they're typically not. As an example, my company has evaluated a bunch of different file formats for document storage, and we've gone with PDF (shudder) and PS, depending on the situation. We didn't go with SVG for two reasons; firstly while multi-page documents are in the official standard, SVG editors and viewers seem to have choppy support for them. Secondly, SVG presents some complications when being printed in an automated fashion (to demonstrate, try this experiment: whip up an SVG file and an equivalent PostScript file, then try to print both using lp).
I mentioned two already (though if you're dealing with a web-app, neither should bite you since your clients will presumably be using the browsers' rendering engine, and you may not need more than one page). The only other one is browser support, which is, as always, choppy on older editions of IE. You also have to be aware of the font situation; either make sure any fancy typography is treated as a path, or make sure to only use fonts that you know viewers will have access to (for web-apps, CSS3 helps a bit there).
SVGs and other vector/procedural representations tend to be smaller, so I'm inclined to say they'll be easier for a server to handle. This isn't based on any testing, so take it with a grain of salt. Keep in mind that they do tend to consume more resources over at the client end, but that shouldn't be a very big deal in a web situation.
If your image can be expressed as an SVG, yes, very good idea. However, converting arbitrary bitmaps to vector representations is AFAIK an open problem. Some things don't convert well, even manually, and some things are actually larger when expressed as SVGs than as JPGs. For things like business documents, flowcharts or typography, vectors are strictly better (barring the font problem I mention above). Certain types of illustrations do better as vectors, and some do better as rasters. Finally, if you're starting out with a bitmap (say, a photograph), converting it to SVG will either noticeably drop quality, or take a lot of manual time (if it can be done well at all).
This is the one I can't really answer, since I've never built anything to the scale you seem to be aiming at.
I'd suggest storing your images in S3 -- don't worry about rolling your own until the economics force you to. It's much better to worry about things your users care about, than how your blobs are stored.
As far as Couchbase (I'm a cofounder) we see people using it in similar use cases: typically for metadata and image tracking (who owns it, timestamps, tags, basically anything you want to store or query on.) The Couchbase record would then just contain a URL to the actual image stored on S3.
"SVGs won't work for regular pictures, don't even dream about it."
"However, converting arbitrary bitmaps to vector representations is AFAIK an open problem. Some things don't convert well, even manually, and some things are actually larger when expressed as SVGs than as JPGs."
I think both these statements are wrong.
https://sites.google.com/site/jcdsvg/svg_paradoxes.svg
See example three and four. The cat image is saved as a medium resolution png file, which allows the zooming of the image to be high resolution. It is a higher file size then a regular web image, but that is on purpose.
Storing bit-mapped images as SVG is as simple as putting them in a SVG container.

Store location information, or use a third party source?

I'm working on a location-based web app (learning purposes), where users will rate local businesses. I then want users to be able to see local businesses, based on where they live and a given range (i.e., businesses within 10 miles of 123 Street. City St, 12345).
I'm wondering what I should use for location info; some 3rd party source (like Googles geocoding API) or host my own location database? I know of zip-code databases that come with lat/lon of each zip code along with other data, but these databases are often not complete, and definitely not global.
I know that most API's set usage limits, which may be a deciding factor. I suppose what I could do is store all data retrieved from Google into a database of my own, so that I never make the same query twice.
What do you guys suggest? I've tried looking at existing answers on SO, but nothing helped.
EDIT To be clear, I need a way to find all businesses that fall within a certain range of a given location. Is there a service that I could use to do this (i.e., return all cities, zips, etc. that fall within range of a given location)
Storing the data you retrieve in a local cache is always a good idea. It will reduce lag and keep from taxing whatever API you are using. It can also help keep you under usage limits as you stated. You can always place size limits on that cache and clear it out as it ages if the need arises.
Using an API means that you'll only be pulling data for sites you need information on, versus buying a bunch of data and having to load/host it all yourself (these can tend to get huge). I suggest using and API+caching

Resources