Detect multi-dimensional array storage contention - arrays

I'm looking for a technique or algorithm that can help with a design idea. Keep in mind that my proposed solution is open for modification, as well.
I have a series of blocks (square and rectangular) that fit into a grid. The grid and the pieces are unique in that some blocks can fit into multiple locations and others can fit only into a limited number of locations. If it helps, think of something like "Battleship" where the pieces have unique connectors which limits their placement.
I imagine this as a multi-dimensional array. Using a sort of collision or storage contention technique I would like to devise the series of solutions where all pieces are able to fit. An optimal solution would be the one which allows the most pieces to fit on the board at any one time.
I have considered Interval Scheduling, a variety of 2D collision detection algorithms, and looked into graph theory (e.g. Flow Network). These all seem to be overkill for the design.
I don't have specific terminology for what I am looking for so searching for a solution is difficult. If I have to brute force this then, OK, but I have to believe there is a more elegant solution.

Related

Does a Decision Network / Decision Forest take into account relationships between inputs

I have experience dealing with Neural Networks, specifically ones of the Back-Propagating nature, and I know that of the inputs passed to the trainer, dependencies between inputs are part of the resulting models knowledge when a hidden layer is introduced.
Is the same true for decision networks?
I have found that information around these algorithms (ID3) etc somewhat hard to find. I have been able to find the actual algorithms, but information such as expected/optimal dataset formats and other overviews are rare.
Thanks.
Decision Trees are actually very easy to provide data to because all they need is a table of data, and which column out of that data what feature (or column) you want to predict on. That data can be discrete or continuous for any feature. Now there are several flavors of decision trees with different support for continuous and discrete values. And they work differently so understanding how each one works can be challenging.
Different decision tree algorithms with comparison of complexity or performance
Depending on the type of algorithm you are interested in it can be hard to find information without reading the actual papers if you want to try and implement it. I've implemented the CART algorithm, and the only option for that was to find the original 200 page book about it. Most of other treatments only discuss ideas like splitting with enough detail, but fail to discuss any other aspect at more than a high level.
As for if they take into account the dependencies between things. I believe it only assumes dependence between each input feature and the prediction feature. If the input was independent from the prediction feature you couldn't use it as a split criteria. But, between other input features I believe they must be independent of each other. I'd have to check the book to ensure that was true or not, but off the top of my head I think that's true.

Algorithm to find related words in a text

I would like to have a word (e.g. "Apple) and process a text (or maybe more). I'd like to come up with related terms. For example: process a document for Apple and find that iPod, iPhone, Mac are terms related to "Apple".
Any idea on how to solve this?
As a starting point: your question relates to text mining.
There are two ways: a statistical approach, and one form natural language processing (nlp).
I do not know much about nlp, but can say something about the statistical approach:
You need some vector space representation of your documents, see
http://en.wikipedia.org/wiki/Vector_space_model
http://en.wikipedia.org/wiki/Document-term_matrix
http://en.wikipedia.org/wiki/Tf%E2%80%93idf
In order to learn semantics, that is: different words mean the same, or one word can have different meanings, you need a large text corpus for learning. As I said this is a statistical approach, so you need lots of samples.
http://www.daviddlewis.com/resources/testcollections/
Maybe you have lots of documents from the context you are going to use. That is the best situation.
You have to retrieve latent factors from this corpus. Most common are:
LSA (http://en.wikipedia.org/wiki/Latent_semantic_analysis)
PLSA (http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis)
nonnegative matrix factorization (http://en.wikipedia.org/wiki/Non-negative_matrix_factorization)
latent dirichlet allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
These methods involve lots of math. Either you dig it, or you have to find good libraries.
I can recommend the following books:
http://www.oreilly.de/catalog/9780596529321/toc.html
http://www.oreilly.de/catalog/9780596516499/index.html
Like all of AI, it's a very difficult problem. You should look into natural language processing to learn about some of the issues.
One very, very simplistic approach can be to build a 2d-table of words, with for each pair of words the average distance (in words) that they appear in the text. Obviously you'll need to limit the maximum distance considered, and possibly the number of words as well. Then, after processing a lot of text you'll have an indicator of how often certain words appear in the same context.
What I would do is get all the words in a text and make a frequency list (how often each word appears). Maybe also add to it a heuristic factor on how far the word is from "Apple". Then read multiple documents, and cross out words that are not common in all the documents. Then prioritize based on the frequency and distance from the keyword. Of course, you will get a lot of garbage and possibly miss some relevant words, but by adjusting the heuristics you should get at least some decent matches.
The technique that you are looking for is called Latent Semantic Analysis (LSA). It is also sometimes called Latent Semantic Indexing. The technique operates on the idea that related concepts occur together in text. It uses statistics to build the word relationships. Given a large enough corpus of documents it will definitely solve your problem of finding related words.
Take a look at vector space models.

Datastore for large astrophysics simulation data

I'm a grad student in astrophysics. I run big simulations using codes mostly developed by others over a decade or so. For examples of these codes, you can check out gadget http://www.mpa-garching.mpg.de/gadget/ and enzo http://code.google.com/p/enzo/. Those are definitely the two most mature codes (they use different methods).
The outputs from these simulations are huge. Depending on your code, your data is a bit different, but it's always big data. You usually take billions of particles and cells to do anything realistic. The biggest runs are terabytes per snapshot and hundreds of snapshots per simulation.
Currently, it seems that the best way to read and write this kind of data is to use HDF5 http://www.hdfgroup.org/HDF5/, which is basically an organized way of using binary files. It's a huge improvement over unformatted binary files with a custom header block (still give me nightmares), but I can't help but think there could be a better way to do this.
I imagine the sheer data size is the issue here, but is there some sort of datastore that can handle terabytes of binary data efficiently, or are binary files the only way at this point?
If it helps, we typically store data columnwise. That is, you have a block of all particle id's, block of all particle positions, block of particle velocites, etc. It's not the prettiest, but it is the fastest for doing something like a particle lookup in some volume.
edit: Sorry for being vague about the issues. Steve is right that this might just be an issue of data structure rather than the data storage method. I have to run now, but I will provide more details late tonight or tomorrow.
edit 2: So the more I look into this, the more I realize that this probably isn't a datastore issue anymore. The main issue with unformatted binary was all the headaches reading the data correctly (getting the block sizes and order right and being sure about it). HDF5 pretty much fixed that and there isn't going to be a faster option until the file system limitations are improved (thanks Matt Turk).
The new issues probably come down to data structure. HDF5 is as performant as we can get, even if it is not the nicest interface to query against. Being used to databases, I thought it would be really interesting/powerful to be able to query something like "give me all particles with velocity over x at any time". You can do something like that now, but you have to work at a lower level. Of course, given how big the data is and depending on what you are doing with it, it might be a good thing to work at a low level for performance sake.
MongoDB: http://www.mongodb.org/
Netezza
Products:
http://www.netezza.com/data-warehouse-appliance-products/skimmer.aspx
Hadoop: http://hadoop.apache.org/
Wikipedia's List of Distributed File
Systems:
http://en.wikipedia.org/wiki/List_of_file_systems#Distributed_file_systems
EDIT
Rationale for my lack of explanation / etc.:
OP says: "[HDF5]'s a huge improvement over unformatted binary files with a custom header block (still give me nightmares), but I can't help but think there could be a better way to do this."
What does "better" mean? Better structured? He seems to allude to the "unformatted binary files" as being an issue - so maybe that's what he means by better. If so, he'll need something with some structure - hence the first couple suggestions.
OP says: "I imagine the sheer data size is the issue here, but is there some sort of datastore that can handle terabytes of binary data efficiently, or are binary files the only way at this point?"
Yes, there are several. Both structured, and "unstructured" - does he want structure, or is he happy to leave them in some sort of "unformatted binary format"? We still don't know - so I suggest checking out some Distributed File Systems.
OP says: "If it helps, we typically store data columnwise. That is, you have a block of all particle id's, block of all particle positions, block of particle velocites, etc. It's not the prettiest, but it is the fastest for doing something like a particle lookup in some volume."
Again, Does the OP want better structure, or doesn't he? Seems like he wants both - better structure AND faster.... maybe scaling OUT will give him this. This further reinforces the first few options I listed.
OP says (in comments): "I don't know if we can take the hit on io though."
Are there IO requirements? Cost restrictions? What are they?
We can't get something for nothing here. There is no "silver-bullet" storage solution. All we have to go on here for requirements is "lots of data" and "I don't know if I like the lack of structure, but I'm not willing to increase my IO to accommodate any additional structure"... so I don't know what kind of answer he's expecting. He hasn't listed a single complaint about the current solution he has other than the lack of structure - and he's already said he's not willing to pay any overhead to do anything about that... so.... ?

Feature selection and unsupervised learning for multilingual data + machine learning algorithm selection

Questions
I want to classify/categorize/cluster/group together a set of several thousand websites. There's data that we can train on, so we can do supervised learning, but it's not data that we've gathered and we're not adamant about using it -- so we're also considering unsupervised learning.
What features can I use in a machine learning algorithm to deal with multilingual data? Note that some of these languages might not have been dealt with in the Natural Language Processing field.
If I were to use an unsupervised learning algorithm, should I just partition the data by language and deal with each language differently? Different languages might have different relevant categories (or not, depending on your psycholinguistic theoretical tendencies), which might affect the decision to partition.
I was thinking of using decision trees, or maybe Support Vector Machines (SVMs) to allow for more features (from my understanding of them). This post suggests random forests instead of SVMs. Any thoughts?
Pragmatical approaches are welcome! (Theoretical ones, too, but those might be saved for later fun.)
Some context
We are trying to classify a corpus of many thousands of websites in 3 to 5 languages (maybe up to 10, but we're not sure).
We have training data in the form of hundreds of websites already classified. However, we may choose to use that data set or not -- if other categories make more sense, we're open to not using the training data that we have, since it is not something we gathered in the first place. We are on the final stages of scraping data/text from websites.
Now we must decide on the issues above. I have done some work with the Brown Corpus and the Brill tagger, but this will not work because of the multiple-languages issue.
We intend to use the Orange machine learning package.
According to the context you have provided, this is a supervised learning problem.
Therefore, you are doing classification, not clustering. If I misunderstood, please update your question to say so.
I would start with the simplest features, namely tokenize the unicode text of the pages, and use a dictionary to translate every new token to a number, and simply consider the existence of a token as a feature.
Next, I would use the simplest algorithm I can - I tend to go with Naive Bayes, but if you have an easy way to run SVM this is also nice.
Compare your results with some baseline - say assigning the most frequent class to all the pages.
Is the simplest approach good enough? If not, start iterating over algorithms and features.
If you go the supervised route, then the fact that the web pages are in multiple languages shouldn't make a difference. If you go with, say lexical features (bag-o'-words style) then each language will end up yielding disjoint sets of features, but that's okay. All of the standard algorithms will likely give comparable results, so just pick one and go with it. I agree with Yuval that Naive Bayes is a good place to start, and only if that doesn't meet your needs that try something like SVMs or random forests.
If you go the unsupervised route, though, the fact that the texts aren't all in the same language might be a big problem. Any reasonable clustering algorithm will first group the texts by language, and then within each language cluster by something like topic (if you're using content words as features). Whether that's a bug or a feature will depend entirely on why you want to classify these texts. If the point is to group documents by topic, irrespective of language, then it's no good. But if you're okay with having different categories for each language, then yeah, you've just got as many separate classification problems as you have languages.
If you do want a unified set of classes, then you'll need some way to link similar documents across languages. Are there any documents in more that one language? If so, you could use them as a kind of statistical Rosetta Stone, to link words in different languages. Then, using something like Latent Semantic Analysis, you could extend that to second-order relations: words in different languages that don't ever occur in the same document, but which tend to co-occur with words which do. Or maybe you could use something like anchor text or properties of the URLs to assign a rough classification to documents in a language-independent manner and use that as a way to get started.
But, honestly, it seems strange to go into a classification problem without a clear idea of what the classes are (or at least what would count as a good classification). Coming up with the classes is the hard part, and it's the part that'll determine whether the project is a success or failure. The actual algorithmic part is fairly rote.
Main answer is: try different approaches. Without actual testing it's very hard to predict what method will give best results. So, I'll just suggest some methods that I would try first and describe their pros and cons.
First of all, I would recommend supervised learning. Even if the data classification is not very accurate, it may still give better results than unsupervised clustering. One of the reasons for it is a number of random factors that are used during clustering. For example, k-means algorithm relies on randomly selected points when starting the process, which can lead to a very different results for different program runnings (though x-means modifications seems to normalize this behavior). Clustering will give good results only if underlying elements produce well separated areas in the feature space.
One of approaches to treating multilingual data is to use multilingual resources as support points. For example, you can index some Wikipedia's articles and create "bridges" between same topics in different languages. Alternatively, you can create multilingual association dictionary like this paper describes.
As for methods, the first thing that comes to mind is instance-based semantic methods like LSI. It uses vector space model to calculate distance between words and/or documents. In contrast to other methods it can efficiently treat synonymy and polysemy. Disadvantage of this method is a computational inefficiency and leak of implementations. One of the phases of LSI makes use of a very big cooccurrence matrix, which for large corpus of documents will require distributed computing and other special treatment. There's modification of LSA called Random Indexing which do not construct full coocurrence matrix, but you'll hardly find appropriate implementation for it. Some time ago I created library in Clojure for this method, but it is pre-alpha now, so I can't recommend using it. Nevertheless, if you decide to give it a try, you can find project 'Clinch' of a user 'faithlessfriend' on github (I'll not post direct link to avoid unnecessary advertisement).
Beyond special semantic methods the rule "simplicity first" must be used. From this point, Naive Bayes is a right point to start from. The only note here is that multinomial version of Naive Bayes is preferable: my experience tells that count of words really does matter.
SVM is a technique for classifying linearly separable data, and text data is almost always not linearly separable (at least several common words appear in any pair of documents). It doesn't mean, that SVM cannot be used for text classification - you still should try it, but results may be much lower than for other machine learning tasks.
I haven't enough experience with decision trees, but using it for efficient text classification seems strange to me. I have seen some examples where they gave excellent results, but when I tried to use C4.5 algorithm for this task, the results were terrible. I believe you should get some software where decision trees are implemented and test them by yourself. It is always better to know then to suggest.
There's much more to say on every topic, so feel free to ask more questions on specific topic.

Well explained algorithms for indexing and searching in metric spaces

I need to implement some kind of metric space search in Postgres(*) (PL or PL/Python). So, I'm looking for good sources (or papers) with a very clear and crisp explanation of the machinery behind these ideas, in such way that I can implement it myself.
I would prefer clarity over efficiency.
(*) The need for that is described better here.
Especially for geographical data, look at PostGIS first to see if you need to implement anything. If you do, start with the papers listed in the Wikipedia entry on GiST.
Looking at your link, it seems your metric space is strings with some sort of edit distance as the metric. A nice but oldish overview of some solutions is given by Navarro, Baeza-Yates, Sutinen, and Tarhio, IEEE Data Engineering Bulletin, 2001; the related papers on Citeseer could also be useful. Locality Sensitive Hashing is a newer technique that might be useful, but a lot of the papers are heavy on math.
BK-Trees are useful for indexing and searching anything that obeys the triangle inequality, metric spaces included. The canonical example is searching for strings within a given edit distance of a target. I wrote an article about that here.
Unfortunately, there's no built in support for this in Postgres. You could implement it yourself using GIST, but obviously that'll be a lot of work. I can't think of any way to implement it without writing your own indexes short of storing the tree in a table, which obviously isn't going to be very efficient.
You can try http://sisap.org where many modern metric indexes are listed, including BK-trees. You can find code in C to try different alternatives.
Some techniques that involve space search that might help you are Hill-Climbing, Neural Network Training, Genetic Algorithm, and Particle Swarm.
You will also need to define a distance metric over your metric space. Have you done so?(& out of curiosity, what is it, if you have done so)

Resources