Looking for source of information relating the most efficient ways to use tf.js methods - tensorflow.js

Is there any source of information relating efficiency of different tfjs methods, for example after benchmarking it seems like using casting on a tf.add operation is much slower than summing the pre existing tensors.
Any ideas?
During bench marking I see operations suche as argMax or max are very slow.

Related

Accuracy Document Embedding in Apache Solr

I made use of Bert document embeddings to perform information retrieval on the CACM dataset. I achieved a very low accuracy score of around 6%. However when I used the traditional BM-25 method, the result was a lot closer to 40% which is close to the average accuracy found in literature for this dataset. This is all being performed within Apache Solr.
I also attempted to perform information retrieval using Doc2Vec and acheived similarly poor results as with BERT. Is it not advisable to use document embeddings for IR tasks such as this one ?
Many people find document embeddings work really well for their purposes!
If they're not working for you, possible reasons include:
insufficiency of training data
problems in your unshown process
different end-goals – what's your idea of 'accuracy'? – than others
It's impossible to say what's affecting your process, & raw perception of its usefulness, without far more details on what you're aiming to achieve, and then doing.
Most notably, if there's other published work using the same dataset, and a similar definition of 'accuracy' on which the other published work claims a far better result using the same methods as give worse results for you, then it's more likely that there are errors in your implementation.
You'd have to name the results you're trying to match (ideally with links to the exact writeups), & show the details of what your code does, for others to have any chance of guessing what's happening for you.

Cache models in Promela

I am looking to model cache for multicore processors, including cache coherence. Do such PROMELA implementations already exist. I tried to search for it, but couldn't find any. Secondly, if I have to implement it myself, is it feasible in PROMELA to declare very large arrays as in to represent cache structures?
I personally don't know of such existing Promela models. Moreover, large array structures sounds like a serious state blow-up.
Depending on what properties you want to show, I would suggest to abstract from reality as much as possible. Modeling things with a high precision compared to the real world is typically nothing one should do in Promela.
Two alternative suggestions:
Model your cache in Java and prove first-order assertions with the KeY proof system
Model your cache in a mathematical fashion using the Coq proof assistant and prove the desired theorems with Coq
[This is the type of question that would be closed… but there aren't many people answering Promela/SPIN questions so you won't get 5 close votes.]
Google Search for 'formal verification cache coherence spin' notes SPIN use a couple of times.
There is a yearly SPIN Workshop; full papers are listed for the last 14 years.

Clustering of 10's of millions of high dimensional data

I have a set of 50 million text snippets and I would like to create some clusters out of them. The dimensionality might be somewhere between 60k-100k. The average text snippet length would be 16 words. As you can imagine, the frequency matrix would be pretty sparse. I am looking for a software package / libray / sdk that would allow me to find those clusters. I had tried CLUTO in the past but this seems a very heavy task for CLUTO. From my research online I found that BIRCH is an algorithm that can handle such problems, but, unfortunately, I couldn't find any BIRCH implementation software online (I only found a couple of ad-hoc implementations, like assignment projects, that lacked any sort of documentation whatsoever). Any suggestions?
You may be interested to checkout the Streaming EM-tree algorithm that uses the TopSig representation. Both are these are from my Ph.D. thesis on the topic of large scale document clustering.
We recently clustered 733 million documents on a single 16-core machine (http://ktree.sf.net). It took about 2.5 days to index the documents and 15 hours to cluster them.
The Streaming EM-tree algorithm can be found at https://github.com/cmdevries/LMW-tree. It works with binary document vectors produced by TopSig which can be found at http://topsig.googlecode.com.
I wrote a blog post about a similar approach earlier at http://chris.de-vries.id.au/2013/07/large-scale-document-clustering.html. However, the EM-tree scales better for parallel execution and also produces better quality clusters.
If you have any questions please feel free to contact me at chris#de-vries.id.au.
My professor made this implementation of BIRCH Algorithm in Java. It is easy to read with some inline comments.
Try it with the graph partition algorithm. It may help you to make clustering on high dimensional data possible.
I suppose you're rather looking for something like all-pairs search.
This will give you pairs of similar records up to desired threshold. You can use bits of graph theory to extract clusters afterwards - consider each pair an edge. Then extracting connected components will give you something like single-linkage clustering, cliques will give you complete linkage clusters.
I just found implementation of BIRCH in C++.

Feature selection and unsupervised learning for multilingual data + machine learning algorithm selection

Questions
I want to classify/categorize/cluster/group together a set of several thousand websites. There's data that we can train on, so we can do supervised learning, but it's not data that we've gathered and we're not adamant about using it -- so we're also considering unsupervised learning.
What features can I use in a machine learning algorithm to deal with multilingual data? Note that some of these languages might not have been dealt with in the Natural Language Processing field.
If I were to use an unsupervised learning algorithm, should I just partition the data by language and deal with each language differently? Different languages might have different relevant categories (or not, depending on your psycholinguistic theoretical tendencies), which might affect the decision to partition.
I was thinking of using decision trees, or maybe Support Vector Machines (SVMs) to allow for more features (from my understanding of them). This post suggests random forests instead of SVMs. Any thoughts?
Pragmatical approaches are welcome! (Theoretical ones, too, but those might be saved for later fun.)
Some context
We are trying to classify a corpus of many thousands of websites in 3 to 5 languages (maybe up to 10, but we're not sure).
We have training data in the form of hundreds of websites already classified. However, we may choose to use that data set or not -- if other categories make more sense, we're open to not using the training data that we have, since it is not something we gathered in the first place. We are on the final stages of scraping data/text from websites.
Now we must decide on the issues above. I have done some work with the Brown Corpus and the Brill tagger, but this will not work because of the multiple-languages issue.
We intend to use the Orange machine learning package.
According to the context you have provided, this is a supervised learning problem.
Therefore, you are doing classification, not clustering. If I misunderstood, please update your question to say so.
I would start with the simplest features, namely tokenize the unicode text of the pages, and use a dictionary to translate every new token to a number, and simply consider the existence of a token as a feature.
Next, I would use the simplest algorithm I can - I tend to go with Naive Bayes, but if you have an easy way to run SVM this is also nice.
Compare your results with some baseline - say assigning the most frequent class to all the pages.
Is the simplest approach good enough? If not, start iterating over algorithms and features.
If you go the supervised route, then the fact that the web pages are in multiple languages shouldn't make a difference. If you go with, say lexical features (bag-o'-words style) then each language will end up yielding disjoint sets of features, but that's okay. All of the standard algorithms will likely give comparable results, so just pick one and go with it. I agree with Yuval that Naive Bayes is a good place to start, and only if that doesn't meet your needs that try something like SVMs or random forests.
If you go the unsupervised route, though, the fact that the texts aren't all in the same language might be a big problem. Any reasonable clustering algorithm will first group the texts by language, and then within each language cluster by something like topic (if you're using content words as features). Whether that's a bug or a feature will depend entirely on why you want to classify these texts. If the point is to group documents by topic, irrespective of language, then it's no good. But if you're okay with having different categories for each language, then yeah, you've just got as many separate classification problems as you have languages.
If you do want a unified set of classes, then you'll need some way to link similar documents across languages. Are there any documents in more that one language? If so, you could use them as a kind of statistical Rosetta Stone, to link words in different languages. Then, using something like Latent Semantic Analysis, you could extend that to second-order relations: words in different languages that don't ever occur in the same document, but which tend to co-occur with words which do. Or maybe you could use something like anchor text or properties of the URLs to assign a rough classification to documents in a language-independent manner and use that as a way to get started.
But, honestly, it seems strange to go into a classification problem without a clear idea of what the classes are (or at least what would count as a good classification). Coming up with the classes is the hard part, and it's the part that'll determine whether the project is a success or failure. The actual algorithmic part is fairly rote.
Main answer is: try different approaches. Without actual testing it's very hard to predict what method will give best results. So, I'll just suggest some methods that I would try first and describe their pros and cons.
First of all, I would recommend supervised learning. Even if the data classification is not very accurate, it may still give better results than unsupervised clustering. One of the reasons for it is a number of random factors that are used during clustering. For example, k-means algorithm relies on randomly selected points when starting the process, which can lead to a very different results for different program runnings (though x-means modifications seems to normalize this behavior). Clustering will give good results only if underlying elements produce well separated areas in the feature space.
One of approaches to treating multilingual data is to use multilingual resources as support points. For example, you can index some Wikipedia's articles and create "bridges" between same topics in different languages. Alternatively, you can create multilingual association dictionary like this paper describes.
As for methods, the first thing that comes to mind is instance-based semantic methods like LSI. It uses vector space model to calculate distance between words and/or documents. In contrast to other methods it can efficiently treat synonymy and polysemy. Disadvantage of this method is a computational inefficiency and leak of implementations. One of the phases of LSI makes use of a very big cooccurrence matrix, which for large corpus of documents will require distributed computing and other special treatment. There's modification of LSA called Random Indexing which do not construct full coocurrence matrix, but you'll hardly find appropriate implementation for it. Some time ago I created library in Clojure for this method, but it is pre-alpha now, so I can't recommend using it. Nevertheless, if you decide to give it a try, you can find project 'Clinch' of a user 'faithlessfriend' on github (I'll not post direct link to avoid unnecessary advertisement).
Beyond special semantic methods the rule "simplicity first" must be used. From this point, Naive Bayes is a right point to start from. The only note here is that multinomial version of Naive Bayes is preferable: my experience tells that count of words really does matter.
SVM is a technique for classifying linearly separable data, and text data is almost always not linearly separable (at least several common words appear in any pair of documents). It doesn't mean, that SVM cannot be used for text classification - you still should try it, but results may be much lower than for other machine learning tasks.
I haven't enough experience with decision trees, but using it for efficient text classification seems strange to me. I have seen some examples where they gave excellent results, but when I tried to use C4.5 algorithm for this task, the results were terrible. I believe you should get some software where decision trees are implemented and test them by yourself. It is always better to know then to suggest.
There's much more to say on every topic, so feel free to ask more questions on specific topic.

Well explained algorithms for indexing and searching in metric spaces

I need to implement some kind of metric space search in Postgres(*) (PL or PL/Python). So, I'm looking for good sources (or papers) with a very clear and crisp explanation of the machinery behind these ideas, in such way that I can implement it myself.
I would prefer clarity over efficiency.
(*) The need for that is described better here.
Especially for geographical data, look at PostGIS first to see if you need to implement anything. If you do, start with the papers listed in the Wikipedia entry on GiST.
Looking at your link, it seems your metric space is strings with some sort of edit distance as the metric. A nice but oldish overview of some solutions is given by Navarro, Baeza-Yates, Sutinen, and Tarhio, IEEE Data Engineering Bulletin, 2001; the related papers on Citeseer could also be useful. Locality Sensitive Hashing is a newer technique that might be useful, but a lot of the papers are heavy on math.
BK-Trees are useful for indexing and searching anything that obeys the triangle inequality, metric spaces included. The canonical example is searching for strings within a given edit distance of a target. I wrote an article about that here.
Unfortunately, there's no built in support for this in Postgres. You could implement it yourself using GIST, but obviously that'll be a lot of work. I can't think of any way to implement it without writing your own indexes short of storing the tree in a table, which obviously isn't going to be very efficient.
You can try http://sisap.org where many modern metric indexes are listed, including BK-trees. You can find code in C to try different alternatives.
Some techniques that involve space search that might help you are Hill-Climbing, Neural Network Training, Genetic Algorithm, and Particle Swarm.
You will also need to define a distance metric over your metric space. Have you done so?(& out of curiosity, what is it, if you have done so)

Resources