I'm trying to determine what might be the best suited (mainly in terms of speed) database service for querying metadata of media content such as images/videos/audio located on AWS S3. Currently I'm looking at DynamoDB and Redshift, but there may be better alternatives I haven't considered.
Example use case:
I have millions of images (and cropped sections of images) ran through a web of machine learning full-image classification, bounding-box object detection, and pixel segmentation (RLE pixel labeled) models, where nested labels are predicted and attributes/scores are assigned. The nested structure is continually evolving. For example, an image may be predicted by a full-image classifier and given the tag "outside", sent to an object detector that detects bounding box locations of multiple "person" tags with x/y/width/height coordinates, then these crops may be sent to a further full (small) image detector that classifies these predicted person crops as "sitting" or "standing". I'd like to be able to speedily query the nested metadata to get the image ID's corresponding to all images with particular combinations of labels.
Specific query example:
What are the S3 locations of all images tagged with the whole-image classification label "outside", with >= two counts of the object detection label "person", and where at least one person object has been further classified as "sitting".
I've been browsing this AWS DB offering page and am not sure what is best suited to this task. Of course, if there's a far superior non-AWS/S3 solution, I'd certainly like to know that. Any suggestions are greatly appreciated!
Edit: Updated the example slightly to describe the nesting structure more clearly.
Related
we are trying to implement a new log system for our IoT device, different applications in the cloud (api, spa, etc). We are trying to design the "Schema" to be the most efficient as possible and we feel there are many good solutions, but it's hard to select one.
Here is a general structure : under the devices node we have our 3 different kinds of IoT devices and similar for infra : different applications and more.
So we were thinking of creating one index for each blue circle and create a hierarchical naming with our indexes so we can take advantage of the wildcard when execute search.
For example :
logs-devices-modules
logs-devices-edges
logs-devices-modules
logs-infra-api
logs-infra-portal
And for mapping, we have different log type in each index and should we map only the common field or everything ? Should we map common field and let the dynamic mapping for the logs type specifics?
Please share your opinion and tips if you have !
Ty.
I would generally map everything to ECS, since Kibana knows the meaning of many fields and it aligns with other inputs.
How much data do you have and how different are your fields? If you don't have too much data (every shard should have >10GB — manage with rollover / ILM ideally) and less than 100 fields in total, I would go for a single index and add a field with with the different names, so you can easily filter on that. Though different retention lengths of the data would favor multiple indices, so you will have to pick the right tradeoffs for your system.
I am building a cloud service with Google Cloud Platform but I don't have much experience using it (yet)
My service will, among other things, store structured entities with properties such as a Name, Description etc. However, I also want each entity to be associated in some way with a collection of images which could have dozens or even hundreds of images.
Looking at the storage options GCP offers the structured nature of my data would suggest I should use Datastore and the images being 'unstructured' should use regular Storage (probably stored in folders to keep images from a particular entity together).
My question is a) is this a reasonable approach for my use case?
if so b) How do I link the two things together?
or if not b) What is the best way to store these?
Your approach sounds OK to me, I'd do it the same way.
As for linking the datastore structured entity to the images an alternate, more scalable approach to that suggested by Andrei Volgin would be to have multiple mapping entitites - one per associated image, containing as properties:
the datastore structured entity's key (or key ID)
the storage name/location of the image
The advantages of such approach (especially when the number of images associated with one structured entity is high) are:
no 1 write/sec limitation on adding/deleting images for the same structured entity
no contention on the structured entity itself when trying to obtain image locations from multiple simultaneous requests
no performance degradation with the increase of the number of images associated to a structured entity (due to the increased entity size needed to be serialized); the size of the structured entity remains small
The disadvantage would be that you need an additional query to obtain the info about the images associated to a structured entity.
These mapping entities can contain additional structured image-related information, if eventually needed.
Your Datastore entities may have a property that contains a list of image file names. Assuming that you put each image in a "folder" that represents the entity ID/Name, you can display an image by simply calling (for example):
"https://storage.googleapis.com/MY_BUCKET/" + entity.getId() + "/" + IMAGE_NAME;
In several of my projects I need to store more data about each image, e.g. its order, orientation, and size. In this case I create an entity to represent each image in Datastore. In some cases I use embedded entities - for example, a product entity contains a list of embedded entities representing images related to this product. This approach allows me to display a list of products with images without an extra query to get images for each product.
I would use two different kind of entities. ex. Album and Images and organize them by using the ancestor path like a file structure. Then I could easily add a Comment entity kind as child of Images.
Example of 2 entities [TaskList:default, Task:sampleTask]
$taskKey = $datastore->key('TaskList', 'default')
->pathElement('Task', 'sampleTask');
Read more about Ancestor paths
Intent:
Create a catalog of aerial photography. Records to be created by AngularJS forms, but retrieved by a map viewer like leaflet.
Goals:
Retrieve information multiple ways: by Flight, Collection, Image by ID, and Image by Geography
Data Structure:
Collection: {
meta: '',
Flights: {
meta: '',
Aerials: {
meta:'',
geopoint: [lat, lon],
geopoly: [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]
}
}
}
Firebase (attempt at denormalizing) Structure:
/Collections/CID
/Collections_Flights/CID
/Collections_Images/CID
/Flights/FID
/Images/IID
(I referenced this stackoverflow)
Questions:
If there are 1 Million Images does the denormalization look adequate? (each flight will have about 80 images, each collection will average 100 flights... if it matters)
Should I use a GeoHash?, if so does the GeoHash become the "Image ID (IID)" for the firebase reference /Images/UID? (Or should I make another reference ex: /Images_Geo/)
Must I use a GeoHash like this example? (On most mapping servers I can pass in a bounding box of the user's current view and the server will return all the items in that location. Not sure how to go about this using Firebase.)
If Collection_Flights and Collections_Images only contain ids, and if you always retrieve them at the same time you fetch Collections/$CID, then you may not need to denormalize those indexes.
The main purpose of denormalizing here would be to make it faster to fetch a list of Collections or to grab Collections/$CID without having to wait for images and flight lists to load. So again, if those are always used in parallel, probably additional complexity for no gain.
The split on Flights/ and Images/, referenced from an index, is an excellent choice.
One million images of 10k each would be about 10GB of data; an important consideration. Assuming the images aren't changing at real-time speeds, you'd probably want to optimize by finding an extremely cheap storage facility (S3, CDN, etc) for those images and just storing the URLs in Firebase, rather than storing them as data and paying real-time rates for bandwidth and storage of these bulky static assets.
Whether you should use GeoHash is a huge topic and quite debatable; there are plenty of alternatives as you've already pointed out. I don't think anyone can tell you the answer to that huge implementation decision in a Q&A format (maybe as a chapter in a textbook or a discussion thread on an appropriate mailing list).
Coming from as SQL/NoSQL background I am finding it quite challenging to model (efficiently that is) the simplest of exercises on a Graph DB.
While different technologies have limitations and best practices, I am uncertain whether the mindset that I am using while creating the models is the correct one, hence, I am in the need of guidance, advice and/or resources to help me get closer to the right practices.
The initial exercise I have tried is representing a file share entire directory (subfolders and files) in a graph DB. For instance some of the attributes and queries I would like to include are;
The hierarchical structure of the folders
The aggregate size at the current node
Being able to search based on who created a file/folder
Being able to search on file types
This brings me to the following questions
When/Which attributes should be used for edges. Only those on which I intend to search? Only relationships?
Should I wish to extend my graph capabilities, for instance, search on files bigger than X? How does one try to maximize the future capabilities/flexibility of the model so that such changes do not cause massive impacts.
Currently I am exploring InfiniteGraph and TitanDB.
1) The only attribute I can think of to describe an edge in a folder hierarchy is whether it is a contains or contained-by relationship.
(You don't even need that if you decide to consider all your edges one or the other. In your case, it looks like you'll almost always be interrogating descendants to search and to return aggregate size).
This is a lot simpler than a network, or a hierarchy where the edges may be of different types. Think an organization chart that tracks not only who manages whom, but who supports whom, mentors whom, harasses whom, whatever.
2) I'm not familiar with the two databases you mentioned, but Neo4J allows indexes on node properties, so adding an index on file_size should not have much impact. It's also "schema-less," so that you can add attributes on the fly and various nodes may contain different attributes.
I am working on a freelance project that captures an audio file, runs some fourier analysis, and spits out three charts (x-y plots). Each chart has about ~3000 data points, which I plan to display with High Charts in the browser.
What database techniques do you recommend for storing and accessing this much data? Should I be storing the points in an array or in multiple rows? I'm considering Mongo too. Plan is to use Rails, so I was hoping to use a single database for both data and authentication.
I haven't dealt with queries accessing this much data for a single page, and this may very well be a tiny overall amount of data. In addition this is an MVP for demonstration to investors, so making it scalable to huge levels isn't of immediate concern.
My initial thought is that using Postgres and having one large table of data points, stored per-row, will be fine, and that that a bunch of doubles is not going to be too memory-intensive relative to images and such.
Realistically, I may just pull 100 evenly-spaced data points to make the chart, but the original data must still be stored.
I've done a lot of Mongo work and I can tell you what I would do if I were you.
One of the very nice properties about your data is that the x,y coordinates are of a fixed size generally. In other words it's not like you are storing comments from users, which can vary greatly in size.
With Mongo I would first make a sample document with the 3,000 points. Just a simple array of x,y points. I would see how big that document is and how my front end handled it - in other words can High Charts handle that?
I would also try to stick to the easiest conceptual model to manage, which is one document per chart, each chart having 3k points. This is a natural way to think of the data and I would start there and see if there were any performance hits. Mongo can easily store those documents, so I think the biggest pain would be in the UI with rendering the data.
Mongo would handle authentication well. I think it's a good choice for general data storage for an MVP.