How to cluster images from a large dataset into groups - dataset

I want cluster an image dataset into several groups using K-means, N-cut or other algorithm, but I don't know how to process those images in the dataset first. These groups should have their own special features. Anyone has any suggestions?

My suggestion is that you go ahead and try a number of features.
Which feature works best for your is very much dependent on your use case.
If you are hoping to group photos by mood, group faces by users or group CAD drawings by the type of gear on it require completely different feature extraction approaches. So you will have to kiss a few frogs to find your prince.

You use the mosaic dataset.
A mosaic dataset allows you to store, manage, view, and query small to vast collections of raster and image data. It is a data model within the geodatabase used to manage a collection of raster datasets (images) stored as a catalog and viewed as a mosaicked image. Mosaic datasets have advanced raster querying capabilities and processing functions and can also be used as a source for serving image services. find more here
And also I refere you to this paper-article:
title: CREATING AN IMAGE DATASET TO MEET YOUR CLASSIFICATION NEEDS:A PROOF-OF-CONCEPT STUDY
By:
James D. Hurd, Research Associate
Daniel L. Civco, Director and Professor

Related

What is the best way to represent data lineage in an image processing pipeline?

I am trying to determine the best way to represent data lineage for image processing. I have a images stored in S3 and I want to process them and then place them back in S3. I would then want to be able to run a query so I can see all the images and processes before and after in a chain. For example:
Image1 -ProcessA-> Image2 -ProcessB-> Image3
I would expect a search for the "lineage" of Image2 would yield the above information.
I know this looks like a cookie-cutter case for a graph database but I am not super familiar with them, especially for a production workflow. I have been fighting with how to implement this model in a relational database, but feel like I am just trying to put the square peg in the round hole.
Is a graph DB the only option? Which flavor would you suggest?
Is there a way to make this work in a relational model that I have not considered?
You are correct when you say this is a cookie-cutter case for a graph database, and any of the available graph database products will likely be able to meet your requirements. You can also solve this problem using a relational database but, as you indicated, it would be like putting a square peg in round hole.
Disclosure: I work for Objectivity, maker of the InfiniteGraph product.
I have solved similar data lineage problems using InfiniteGraph. The basic idea is to separate your data from your metadata. The "lineage" information is metadata. Let's put that in the graph database. The lineage information will include objects (nodes) that contain the metadata for images and the workflow process steps that consume images as input and generated images or other information as output.
We might define an ImageMD type in Infinite graph to contain the metadata for an image, including a URI that defines where the image data is currently stored, and the size and format of the image. We might define the ProcessMD type to describe an application that operates on image. It's attributes might include the name and version of the application as well as it deployment timestamp and host location where it is running.
You are going to end up with an environment that looks something like the following diagram.
Then, given an image, you can track its lineage backward to see its history and forward to see how it or it derivative components were evolved or used.
This is the basis for the Objectivity, Inc. application Metadata Connect.

Multiple Product Images Database Design

I'm trying to figure out the best way to add multiple product images for a given product. The amount of images will vary depending on the product.
I was thinking about using a pivot table. Do you have any other ideas? I'm also open to critiques of my overall design.
My product db draft

Neo4J: Binary File storage and Text Search "stack"

I have a project I would like to work on which I feel is a beautiful case for Neo4j. But there are aspects about implementing this that I do not understand enough to succinctly list my questions. So instead, I'll let the scenario speak for itself:
Scenario: In simplicity, I want to build an application that will allow Users who will receive files of various types such as docs, excel, word, images, audio clips and even videos - although not so much videos, and allow them to upload and categorize these.
With each file they will enter in any and all associations. Examples:
If Joe authors a PDF, Joe is associated with the PDF.
If a DOC says that Sally is Mary's mother, Sally is associated with Mary.
If Bill sent an email to Jane, Bill is associated with Jane (and the email).
If company X sends an invoice (Excel grid) to company Y, X is associated with Y.
and so on...
So the basic goal at this point would be to:
Have users load in files as they receive them.
Enter the associations that each file contains.
Review associations holistically, in order to predict or take some action.
Generate a report of the interested associations including the files that the associations are based on.
The value for this project is in the associations, which in reality would grow much more complex then the above examples and should produce interesting conclusions. However. if the User is asked "How did you come to that conclusion", they need to be able to produce a summary of the associations as well as any files that these associations are based on - ie the PDF or EXCEL or whatever.
Initial thoughts...
I also should also add that this applicatoin would be hosted internally, and probably used by approx 50 Users so I probably don't need super-duper, fastest, scalable, high availability possible solution. The data being loaded could get rather large though, maybe up to a terabyte in a year? (Not the associations but the actual files)
Wouldn't it be great if Neo4J just did all of this! Obviously it should handle the graph aspects of this very nicely, but I figure that the file storage and text search is going to need another player added to the mix.
Some combinations of solutions I know of would be:
Store EVERYTHING including files as binary in Neo4J.
Would be wrestling Neo4J for something its not built for.
How would I search text?
Store only associations and meta data in Neo4J and uploaded file on File system.
How would I do text searches on files that are stored on file server?
Store only associations and meta data in Neo4J and uploaded file in Postgres.
Not so confident of having all my files inside DB. Feel more comfortable having all my files accessible in folders.
Everyone says its great to put your files in DB. Everyone says its not great to put your files in DB.
Get to the bloody questions..
Can anyone suggest a good "stack" that would suit the above?
Please give a basic outline on how you would implement your suggestion, ie:
Have the application store the data into Neo4J, then use triggers to update Postgres.
Or have the files loaded into Postgres and triggers update Neo4J.
Or Have the application load data to Nea4J and then application loads data into Postgres.
etc
How you would tie these together is probably what I am really trying to grasp.
Thank you very much for any input on this.
Cheers.
p.s. What a ramble! If you feel the need to edit my question or title to simplify, go for it! :)
Here's my recommendations:
Never store binary files in the database. Store in filesystem or a service like AWS S3 instead and reference the file in your data model.
I would store the file first in S3 and a reference to it in your primary database (Neo4j?)
If you want to be able to search for any word in a document I would recommend using a full text search engine like Elastic Search. Elastic Search can scan multiple document formats like PDF using Tika.
You can probably also use Elastic/Tika to search for relationships in the document and surface them in order to update your graph.
Suggested Stack:
Neo4j
ElasticSearch
AWS S3 or some other redundant filesystem to avoid data loss
Bonus: See this SO question/answer for best practices on indexing files in multiple formats using ES.

Large classification document corpus

Can anyone point me to some large corpus that I use for classification?
But by large I don't mean Reuters or 20 newsgroups, I'm talking about a corpus of GB size, not 20MB or something like that.
I was able only to find this Reuters and 20 newsgroups, which is very small for the thing I need.
The most popular datasets for text-classification evaluation are:
Reuters Dataset
20 Newsgroup Dataset
However the datasets above does not meet the 'large' requirement. Below datasets might meet your criteria:
Commoncrawl You could build a large corpus by extracting articles that have specific keywords in the meta tag and apply to document classification.
Enron Email Dataset You could do a variety of different classifcation tasks here.
Topic Annotated Enron Dataset . Not free but already labelled and meets your large corpus request
You can browse other publicly available datasets here
Other than the above you might have to develop your own corpus.I will be releasing a news corpus builder later this weekend that will help you develop custom corpora based on topics of your choice
Update:
Had created the custom corpus builder module I mentioned above but forgot to link it News Corpus Builder
Huge Reddit archive spanning 10/2007 to 5/2015

Google Maps Fusion Tables feasibility

I am wondering if someone can provide some insight about an approach for google maps. Currently I am developing a visualization with google maps api v3. This visualization will map out polygons for; country, state, zip code, cities, etc. As well as map 3 other markers(balloon, circle..). This data is dynamically driven by an underlying report which can have filters applied and can be drilled to many levels. The biggest problem I am running into is dynamically rendering the polygons. The data necessary to generate a polygon with Google Maps V3 is large. It also requires a good deal of processing at runtime.
My thought is that since my visualization will never allow the user to return very large data sets(all zip codes for USA). I could employ the use of dynamically created fusion tables.
Lets say for each run my report will return 50 states or 50 zip codes. Users can drill from state>zip.
The first run of the visualization users will run a report ad it will return the state name and 4 metrics. Would it be possible to dynamically create a fusion table based on this information? Would I be able to pass through 4 metrics and formatting for all of the different markers to be drawn on the map?
The second run the user will drill from state to zip code. The report will then return 50 zip codes and 4 metrics. Could the initial table be dropped and another table be created to map a map with the same requirements as above? Providing the fusion tables zip code(22054, 55678....) and 4 metric values and formatting.
Sorry for being long winded. Even after reading the fusion table documentation I am not 100% certain on this.
Fully-hosted solution
If you can upload the full dataset and get Google to do the drill-down, you could check out the Google Maps Engine platform. It's built to handle big sets of geospatial data, so you don't have to do the heavy lifting.
Product page is here: http://www.google.com/intl/en/enterprise/mapsearth/products/mapsengine.html
API doco here: https://developers.google.com/maps-engine/
Details on hooking your data up with the normal Maps API here: https://developers.google.com/maps/documentation/javascript/mapsenginelayers
Dynamic hosted solution
However, since you want to do this dynamically it's a little trickier. Neither the Fusion Tables API nor the Maps Engine API at this point in time support table creation via their APIs, so your best option is to model your data in a consistent schema so you can create your table (in either platform) ahead of time and use the API to upload & delete data on demand.
For example, you could create a table in MapsEngine ahead of time for each drill-down level (e.g. one for state, one for zip-code) & use the batchInsert method to add data at run-time.
If you prefer Fusion Tables, you can use insert or importRows.
Client-side solution
The above solutions are fairly complex & you may be better off generating your shapes using the Maps v3 API drawing features (e.g. simple polygons).
If your data mapping is quite complex, you may find it easier to bind your data to a Google Map using D3.js. There's a good example here. Unfortunately, this does mean investigating yet another API.

Resources