Storing images and structured data together (Google Cloud Platform) - google-app-engine

I am building a cloud service with Google Cloud Platform but I don't have much experience using it (yet)
My service will, among other things, store structured entities with properties such as a Name, Description etc. However, I also want each entity to be associated in some way with a collection of images which could have dozens or even hundreds of images.
Looking at the storage options GCP offers the structured nature of my data would suggest I should use Datastore and the images being 'unstructured' should use regular Storage (probably stored in folders to keep images from a particular entity together).
My question is a) is this a reasonable approach for my use case?
if so b) How do I link the two things together?
or if not b) What is the best way to store these?

Your approach sounds OK to me, I'd do it the same way.
As for linking the datastore structured entity to the images an alternate, more scalable approach to that suggested by Andrei Volgin would be to have multiple mapping entitites - one per associated image, containing as properties:
the datastore structured entity's key (or key ID)
the storage name/location of the image
The advantages of such approach (especially when the number of images associated with one structured entity is high) are:
no 1 write/sec limitation on adding/deleting images for the same structured entity
no contention on the structured entity itself when trying to obtain image locations from multiple simultaneous requests
no performance degradation with the increase of the number of images associated to a structured entity (due to the increased entity size needed to be serialized); the size of the structured entity remains small
The disadvantage would be that you need an additional query to obtain the info about the images associated to a structured entity.
These mapping entities can contain additional structured image-related information, if eventually needed.

Your Datastore entities may have a property that contains a list of image file names. Assuming that you put each image in a "folder" that represents the entity ID/Name, you can display an image by simply calling (for example):
"https://storage.googleapis.com/MY_BUCKET/" + entity.getId() + "/" + IMAGE_NAME;
In several of my projects I need to store more data about each image, e.g. its order, orientation, and size. In this case I create an entity to represent each image in Datastore. In some cases I use embedded entities - for example, a product entity contains a list of embedded entities representing images related to this product. This approach allows me to display a list of products with images without an extra query to get images for each product.

I would use two different kind of entities. ex. Album and Images and organize them by using the ancestor path like a file structure. Then I could easily add a Comment entity kind as child of Images.
Example of 2 entities [TaskList:default, Task:sampleTask]
$taskKey = $datastore->key('TaskList', 'default')
->pathElement('Task', 'sampleTask');
Read more about Ancestor paths

Related

Fastest AWS DB service for large media nested metadata querying

I'm trying to determine what might be the best suited (mainly in terms of speed) database service for querying metadata of media content such as images/videos/audio located on AWS S3. Currently I'm looking at DynamoDB and Redshift, but there may be better alternatives I haven't considered.
Example use case:
I have millions of images (and cropped sections of images) ran through a web of machine learning full-image classification, bounding-box object detection, and pixel segmentation (RLE pixel labeled) models, where nested labels are predicted and attributes/scores are assigned. The nested structure is continually evolving. For example, an image may be predicted by a full-image classifier and given the tag "outside", sent to an object detector that detects bounding box locations of multiple "person" tags with x/y/width/height coordinates, then these crops may be sent to a further full (small) image detector that classifies these predicted person crops as "sitting" or "standing". I'd like to be able to speedily query the nested metadata to get the image ID's corresponding to all images with particular combinations of labels.
Specific query example:
What are the S3 locations of all images tagged with the whole-image classification label "outside", with >= two counts of the object detection label "person", and where at least one person object has been further classified as "sitting".
I've been browsing this AWS DB offering page and am not sure what is best suited to this task. Of course, if there's a far superior non-AWS/S3 solution, I'd certainly like to know that. Any suggestions are greatly appreciated!
Edit: Updated the example slightly to describe the nesting structure more clearly.

Is it better to store multiple values in one attribute as an array, or create another table for it?

I'm building an app the attach multiple pictures to a recording. Hence, there is a "Recordings" entity with attributes "Name" and "URL". I want to attach multiple images to one recording.
So do I add another attribute "images" and store array of images? If yes, how is that possible?
Or do I create another Entity that has attributes "Image" and "RecordingID" that has all images of all recordings and each image is connected to it's recording with with the recording ID? If yes, how do I create a unique ID for the recordings?
Please answer which one is better performance wise, and with your choice explain its associated question.
It depends on how you use the data, as always.
If you use an array of images, Core Data will load all of them as soon as you need one of them. You'll either have every image in memory at the same time, or none of them. If you'll always use all of the images at the same time, this might be OK.
If you use a separate entity, you can load individual images when you need them. It's slightly more complex but can reduce memory requirements. It also makes more sense if you decide you need more metadata for each image, for example a creation date.
In both cases you'll be better off saving the image to a file and putting just the filename in your persistent store. It's usually best to keep binary blobs like images and sounds out of Core Data. Binary is OK if you know the value will be very small, but images can potentially be quite large.
I'm not sure why you want a unique ID in the second case where you don't use one in the first case. If you need a unique ID for some reason, the UUID class is a convenient way to generate one.

Hierarchical data structure in firebase with geospatial/geohash searching; am I doing this right?

Intent:
Create a catalog of aerial photography. Records to be created by AngularJS forms, but retrieved by a map viewer like leaflet.
Goals:
Retrieve information multiple ways: by Flight, Collection, Image by ID, and Image by Geography
Data Structure:
Collection: {
meta: '',
Flights: {
meta: '',
Aerials: {
meta:'',
geopoint: [lat, lon],
geopoly: [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]
}
}
}
Firebase (attempt at denormalizing) Structure:
/Collections/CID
/Collections_Flights/CID
/Collections_Images/CID
/Flights/FID
/Images/IID
(I referenced this stackoverflow)
Questions:
If there are 1 Million Images does the denormalization look adequate? (each flight will have about 80 images, each collection will average 100 flights... if it matters)
Should I use a GeoHash?, if so does the GeoHash become the "Image ID (IID)" for the firebase reference /Images/UID? (Or should I make another reference ex: /Images_Geo/)
Must I use a GeoHash like this example? (On most mapping servers I can pass in a bounding box of the user's current view and the server will return all the items in that location. Not sure how to go about this using Firebase.)
If Collection_Flights and Collections_Images only contain ids, and if you always retrieve them at the same time you fetch Collections/$CID, then you may not need to denormalize those indexes.
The main purpose of denormalizing here would be to make it faster to fetch a list of Collections or to grab Collections/$CID without having to wait for images and flight lists to load. So again, if those are always used in parallel, probably additional complexity for no gain.
The split on Flights/ and Images/, referenced from an index, is an excellent choice.
One million images of 10k each would be about 10GB of data; an important consideration. Assuming the images aren't changing at real-time speeds, you'd probably want to optimize by finding an extremely cheap storage facility (S3, CDN, etc) for those images and just storing the URLs in Firebase, rather than storing them as data and paying real-time rates for bandwidth and storage of these bulky static assets.
Whether you should use GeoHash is a huge topic and quite debatable; there are plenty of alternatives as you've already pointed out. I don't think anyone can tell you the answer to that huge implementation decision in a Q&A format (maybe as a chapter in a textbook or a discussion thread on an appropriate mailing list).

SimpleDB Select VS DynamoDB Scan

I am making a mobile iOS app. A user can create an account, and upload strings. It will be like twitter, you can follow people, have profile pictures etc. I cannot estimate the user base, but if the app takes off, the total dataset may be fairly large.
I am storing the actual objects on Amazon S3, and the keys on a DataBase, listing Amazon S3 keys is slow. So which would be better for storing keys?
This is my knowledge of SimpleDB and DynamoDB:
SimpleDB:
Cheap
Performs well
Designed for small/medium datasets
Can query using select expressions
DynamoDB:
Costly
Extremely scalable
Performs great; millisecond response
Cannot query
These points are correct to my understanding, DynamoDB is more about killer. speed and scalability, SimpleDB is more about querying and price (still delivering good performance). But if you look at it this way, which will be faster, downloading ALL keys from DynamoDB, or doing a select query with SimpleDB... hard right? One is using a blazing fast database to download a lot (and then we have to match them), and the other is using a reasonably good-performance database to query and download the few correct objects. So, which is faster:
DynamoDB downloading everything and matching OR SimpleDB querying and downloading that
(NOTE: Matching just means using -rangeOfString and string comparison, nothing power consuming or non-time efficient or anything server side)
My S3 keys will use this format for every type of object
accountUsername:typeOfObject:randomGeneratedKey
E.g. If you are referencing to an account object
Rohan:Account:shd83SHD93028rF
Or a profile picture:
Rohan:ProfilePic:Nck83S348DD93028rF37849SNDh
I have the randomly generated key for uniqueness, it does not refer to anything, it is simply there so that keys are not repeated therefore overlapping two objects.
In my app, I can either choose SimpleDB or DynamoDB, so here are the two options:
Use SimpleDB, store keys with the format but not use the format for any reference, instead use attributes stored with SimpleDB. So, I store the key with attributes like username, type and maybe others I would also have to include in the key format. So if I want to get the account object from user 'Rohan'. I just use SimpleDB Select to query the attribute 'username' and the attribute 'type'. (where I match for 'account')
DynamoDB, store keys and each key will have the illustrated format. I scan the whole database returning every single key. Then get the key and take advantage of the key format, I can use -rangeOfString to match the ones I want and then download from S3.
Also, SimpleDB is apparently geographically-distributed, how can I enable that though?
So which is quicker and more reliable? Using SimpleDB to query keys with attributes. Or using DynamoDB to store all keys, scan (download all keys) and match using e.g. -rangeOfString? Mind the fact that these are just short keys that are pointers to S3 objects.
Here is my last question, and the amount of objects in the database will vary on the decided answer, should I:
Create a separate key/object for every single object a user has
Create an account key/object and store all information inside there
There would be different advantages and disadvantages points between these two options, obviously. For example, it would be quicker to retrieve if it is all separate, but it is also more organized and less large of a dataset for storing it in one users account.
So what do you think?
Thanks for the help! I have put a bounty on this, really need an answer ASAP.
Wow! What a Question :)
Ok, lets discuss some aspects:
S3
S3 Performance is low most likely as you're not adding a Prefix for Listing Keys.
If you sharding by storing the objects like: type/owner/id, listing all the ids for a given owner (prefixed as type/owner/) will be fast. Or at least, faster than listing everything at once.
Dynamo Versus SimpleDB
In general, thats my advice:
Use SimpleDB when:
Your entity storage isn't going to pass over 10GB
You need to apply complex queries involving multiple fields
Your queries aren't well defined
You can leverage from Multi-Valued Data Types
Use DynamoDB when:
Your entity storage will pass 10GB
You want to scale demand / throughput as it goes
Your queries and model is well-defined, and unlikely to change.
Your model is dynamic, involving a loose schema
You can cache on your client-side your queries (so you can save on throughput by querying the cache prior to Dynamo)
You want to do aggregate/rollup summaries, by using Atomic Updates
Given your current description, it seems SimpleDB is actually better, since:
- Your model isn't completely defined
- You can defer some decision aspects, since it takes a while to hit the (10GiB) limits
Geographical SimpleDB
It doesn't support. It works only from us-east-1 afaik.
Key Naming
This applies most to Dynamo: Whenever you can, use Hash + Range Key. But you could also create keys using Hash, and apply some queries, like:
List all my records on table T which starts with accountid:
List all my records on table T which starts with accountid:image
However, those are Scans at all. Bear that in mind.
(See this for an overview: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/API_Scan.html)
Bonus Track
If you're using Java, cloudy-data on Maven Central includes SimpleJPA with some extensions to Map Blob Fields to S3. So give it a look:
http://bitbucket.org/ingenieux/cloudy
Thank you

Methods for storing metadata associated with individual files?

Given a collection of files which will have associated metadata, what are the recommended methods for storing this metadata?
Some files formats support storing metadata internally (EXIF,ID3,etc), but not all file formats support this, so what are more general options?
Some of the metadata would almost certainly be unique (titles/descriptions/etc), whilst some would be repetitive to varying degrees (categories/tags/etc).
It may also be useful to group the metadata, if different types of attribute are required.
Ideally, solutions should cover concepts, rather than specific language implementations.
To store metadata in database has some advantages but main problem with database is that metadata are not directly connected to your data. It is more robust if metada stay with data - like special file in the directory or something like that.
Some filesystems offer special functionality that can be used for metadata - like NTFS Alternate streams. Unfortunately, this can be used for metadata storage in special cases only, because those streams can be easily lost when copying data to storage system that does not support it. I believe that linux filesystems have also similar storage mechanism.
Anyway, most common solutions are :
separate hidden file(s) (per directory) that hold metadata
some application use special hidden directory with metadata (like subversion, cvs etc).
or database (of various kinds) for all application specific metada - this database can be used also for caching purposes in most cases
IMO there is no general purpose solution. I would choose storage of metadata in hidden file (robustness) with use of the database for fast access and caching.
One option might be a relational database, structured like this:
FILE
f_id
f_location
f_title
f_description
ATTRIBUTE
a_id
a_label
VALUE
v_id
v_label
METADATA
md_file
md_attribute
md_value
This implementation has some unique information (title/description),
but is primarily targetted at repetitive groups of data.
For some requirements, other less generic tables may be more useful.
This has advantages of this being that relational databases are very common,
and obviously very good at handling relationships and storing lots of data.
However, for some uses a database server brings an overhead which might not be desirable.
Also, the database server is distinct from the files - they do not sit together, and require different methods of interaction.
Databases do not (easily) sit under version control - which may be a good or bad thing, depending on your point of view and specific needs.
I think the "solution" depends greatly upon what you're going to be doing with the metadata.
For example, almost all of the metadata we store (Multiple datasets of scientific data) are all chopped up and stored in a database. This allows us to create datasets to preserve the common metadata between the files (as you say, categories and tags) while we have file specific structures (title, start/stop time, min/max values etc.) While we could keep these in hidden files, we do a lot of searching and open our interface to outside consumers via web services.
If you're storing metadata that isn't going to be searched on, hidden files or a dedicated .xml file per "real" file isn't a bad route to take. It's readable by basically anything, can be converted to different formats easily, and won't be lost if you decide to change your storage mechanism.
Metadata should help you, not hinder you. I've seen (and been a part of) systems where metadata storage has become more burdensome than storing the actual data, and became a liability. Just keep in mind what you are trying to do with it, and don't over extend yourself with "what ifs."
Plain text has some obvious advantages over anything else. Something like
FileName = 'ferrari.gif'
Title = 'My brand new car'
Tags = 'cars', 'cool'
Related = 'michaelknight.mp3'
Picasa's Picasa.ini files are a good example for this kind of metadata. Also, instead of inventing your own format, XML might be worth considering. There are plenty of readily available DOM processors to deal with this format.
Then again, if the amount of files and relations between them is huge, databases may be better.
I would basically make a metadata DB which held this information:
RESOURCE_TABLE
RESOURCE_ID
RESOURCE_TYPE (folder, doctype, web link, other)
RESOURCE_URL (any URL)
NOTES_TABLE
NOTE_ID
RESOURCE_NO
RESOURCE_NOTE (long text)
TAGS_TABLE
TAG_ID
RESOURCE_NO
TAG_TEXT
Then I would use the note field textual notes to the file/folder/resource. Choose if you would use 1:1 or 1:N for this.
The tags field I would use to store any number of searchable parameters like YEAR, PROJECT, and other values that will describe and group your content.
Then you could add tables for owner, stakeholders, and other organisation info etc.

Resources