Distributed Text Clustering with Slurm - artificial-intelligence

I want to build a distributed AI-based text classification solution (e.g. based on distributed k-means) , which should work on my cluster based on Slurm. The solution should cluster the input-documents so that similar documents will be grouped together.
However, I am not sure, which frameworks etc. to use - has someone ideas how I could approach this?

Be careful, the word 'classification' is used for describing a supervised task trained with labels. What you're describing is text clustering, which is unsupervised with no labels.
More precisely, what you're describing is topic modelling, a standard task in NLP.
There are various algorithms, the most standard is probably LDA. There are also more recent approaches with DL, for example Bertopic.
About distributing with Slurm, there are apparently options as well, for example with Spark (apparently Spark can be used on top of Slurm.)

Related

Database that allows full text search in O(1)

I have a database of documents where searching quickly for keywords and patterns would be very useful to have.
I know of "Burrows–Wheeler transform"/FM-index. I wonder if there are any programs or database programs based on BWT or similar methods in order to search a corpus in O(1) and hopefully more advantages.
Any ideas?
There is a great book by Witten/Moffat/Bell (1994) Managing Gigabytes; this describes in detail everything you need to know about indexing and retrieval. I think their sourcecode is also available, or has been made available in an information retrieval library.
However, it doesn't include the Burrows-Wheeler transform, as that was only invented in the same year.

Using Neo4j and Lucene in a distributed system

I am looking into Neo4j as a stripped-down document store. A key aspect of document storage is search, and I know Neo4j includes full text search via legacy indices provided by Lucene.
I would be very interested in hearing the limitations of Neo4j search capabilities in a distributed environment. Does it provide a distributed index? In what ways is it inferior to Solr or ElasticSearch? How far can I take it before I must install Solr?
-- EDIT --
We are trying to integrate two distinct search efforts. The first is standard text content search. For instance, using the Enron emails, we want to search for every email that matches "bananas" or "going to the store" and get those document bodies in response. This is where people often turn to Solr.
The second case is more complicated, we have attached a great deal of meta-data to each document. We may have decided that "these" emails were the result of late-night drunk-dialing. Now I want to search for all emails that may have been the result of late-night drunk-dialing. For this kind of meta-data, we believe a graph database is in order.
In a perfect world, I can use one platform to perform both queries. I appreciate that Neo4j (nor OrientDB, Arango, etc) are designed as full text search databases, but I'm trying to understand the limitations thereof.
In terms of volume, we are dealing at a very large scale with batch-style nightly updates. The data is content heavy, with some documents running into hundreds of pages of text, but mostly on the order of a page or two.
I once worked on a health social network where we needed some sort of search and connection search functionalities we first went on neo4j we were very impressed by the cypher query language we could get and express any request however when you throw there billion of nodes you start to pay the price and we started considering another graph db, this time we've made a lot of research, tests and OrientDB was clearly the winner, OrientDB is highly scalable but the thing is that you have to code by yourself, your "search algorithm" if you want to do some advanced things (what is the common point between this two nodes) otherwise you have the SQL like query language (i don't know/remember if he has a name) but you can do some interesting stuff with it
So in conclusion i would definitely go on OrientDB
Neo4j can provide a "distributed index" in the sense that the high availability cluster can make your index available on more than one machine, but I'm pretty sure that's not what you're after. Related to this issue is a different answer I wrote about graph partitioning, and what it takes to distribute a really large number of nodes/relationships across multiple machines. (It's not terribly simple)
Solr and Lucene do two different things (although Solr is built on top of Lucene). I think solr and neo4j are not comparable because they're trying to do completely different things. This site isn't about software recommendations so I can't tell you what you should use other than to say you should read up on solr and neo4j, and figure out which set of functionality you want. As far as I know, this is an exclusive decision as I'm not aware of people integrating solr with neo4j.
Your question is very difficult to answer, I'd recommend expanding on what you are trying to do and what you have tried, you'll probably get better responses.

Ground Truth datasets for Evaluating Open Source NLP tools for Named Entity Recognition

I am working on building a document similarity graph for a collection. I already do all the basic things like tokenization, stemming, stop-word removal, and bag-of-word representation to represent the documents and computing similarity using Jaccard coefficient. I am now trying to extract Named Entities and evaluate if these would be helpful in improving the quality of the document similarity graph. I have been spending much of time on finding ground-truth datasets for my analysis. I have been very disappointed with Message Understanding Conference (MUC) datasets. They are cryptic to understand and requires sufficient data cleaning/massaging before it can be used on a different platform (like Scala)
My questions are here more specifically
Are there tutorials on getting started with MUC datasets that would make it easier for analyzing the results using open source NLP tools like openNLP
there other datasets available?
Tools like OpenNLP and Stanford Core NLP employ approaches that are essentially supervised. Correct?
GATE is a great tool for hand-annotating your own text corpus Correct?
For a new test dataset (that I hand-create) how can I compute the baseline (Vocabulary Transfer) or what kind of metrics can I compute?
First of all, I have a few concerns about using Jaccard coefficient to compute similarity. I'd expect TF.IDF and cosinus similarity to give better results.
Some answers to your questions:
See the CoNLL 203 evaluation campaign: it also provides data, evaluation tools, etc. You ma also have a look at ACE.
Yes
Gate is also a pipeline that automatically annotates text, but as far as I know NER is a rule-based component.
A baseline is most of the time a very simple algorithm (e.g. majority classes) so it is not a baseline to compare corpus, but to compare approaches.

hadoop vs teradata what is the difference

I've touched a Teradata. I've never touched hadoop, but since yesterday, I am doing some research on that. By description of both, they seem quite interchangable, but in some papers it is written that they serve for different purposes. But all I found is vague. I am confused.
Has anybody experience with both of them? What is the serious difference between them?
Simple Example: I want to build ETL which will transform billions rows of raw data and organize them to DWH. Then do some resources expensive analysis on them. Why use TD? Why Hadoop? or why not?
I think this article titled 'MapReduce and Parallel DBMSs: Friends or Foes' does quite a good job describing the situations where each technology works best. In a nutshell, Hadoop is excellent for storing unstructured data and running parallel transformations to 'sanitize' incoming data, where DBMSs excel at executing complex queries quickly.
Hadoop, Hadoop with Extensions, RDBMS Feature/Property Comparison
I am not an expert in this area, but in the coursera.com course, Introduction to Data Science, there is a lecture titled: Comparing MapReduce and Databases as well as a lecture on Parallel databases within the map reduce section of the course.
Here is a summary from these lectures on the comparison of MapReduce vs. RDBMS (not necessarily parallel RDMBS).
One point to remember is that the comparison is different if you include extensions to Hadoop like PIG, Hive, etc. I will put in () MapReduce extensions that add some of these functionality/properties.
Some functionality/properties that RDBMS have but not native MapReduce:
Declaritive query languages -(Pig, HIVE)
Schemas (Hive, Pig, DyradLINQ, Hadapt)
Logical Data Independence
Indexing (Hbase)
Algebraic Optimization (Pig, Dryad, HIVE)
Caching/Materialized Views
ACID/Transactions
MapReduce (relative to regular RDBMS not necessarily Parallel RDMBS)
High Scalability
Fault-tolerance
“One-person deployment”
I've been asked this question several times, the answer that I usually give is a car analogy (which is pretty silly because I'm not a car person - but it seems to work)
Teradata is the car/dbms for the masses - it is reliable, mature, works well and is there when you need it. It is difficult (compared to Hadoop) to customise and add functionality to the base product.
Hadoop is the car/dbms for the enthusiast - it isn't as reliable or mature, it works well so long as you attend to it. It is easy (compared to Teradata) to customise and add functionality to the base product.
Put another way, Teradata is the reliable workhorse where you put your mission critical process (operational reporting, enterprise reporting, decision support etc).
Hadoop is the place where you can do alot of this stuff, but don't be surprised if you come in one morning and find that your regulatory reports can't be produced because someone applied a patch or you've suddenly got a "too many small files" problem.
To loop back into the analogy, if you don't want to be too techy and the manufacturers product (dbms and/or car) works for you out of the box, Teradata is a good option.
On the other hand, if you like to tinker under the hood, swap out the carburettor (or whatever), adjust the gear ratios, tweak the fuel air mixture depending on whether you are country or city driving, bolt on a Turbo charger and/or your family complain about how long you spend in the garage on weekends - Hadoop is the place for you.
IMHO, Most, if not all organisations need both.
I hope this helps :-)
To Begin with, Vanilla Apache Hadoop is 100% open source. But if you need commercial support along with consultancy there are companies like Cloudera, MapR, HortonWorks, etc.
Hadoop is backed by a growing community fixing bugs and making improvements on a consistent basis. Hadoop storage model HDFS is based on Google's GFS architecture which is proven to handle large quantities of data. Furthermore Hadoop analysis model Map Reduce is based on Google's Map Reduce Model.
Hadoop is used by Tech Giants like Facebook, Yahoo, Twitter, EBay etc to store and analysis they high volume of data real time as well as passively.
For your question ETL systems read these slides where you will see.
Ok now Why Hadoop?
Open Source
Proven Storage and Analysis model for Large Quantities of data
Minimum Hardware Requirement to setup and run.
Ok now Why TD?
Commercial Support

Algorithms and data structures for 'tag' or key word based query on large data set?

I am trying to implement a storage system to support tagging on data. A very simple application of this system is like questions on Stackoverflow, which are tagged with multiple tags. And a query may consist of multiple tags. This also looks like search on Google with multiple key words.
The data set maintained by this system will be very large, like several or tens of terabytes with billions of entries.
So what data structures and algorithms should I use in this system for maintaining and query data? And the data may be stored across a cluster of machines.
Are there any guide or papers to describe such problem and solutions?
You might want to read the two books below:
Collective Intelligence in Action
Satnam Alag (ISBN: 1933988312)
http://www.manning.com/alag/
"Capter 3. Extracting intelligence from tags" covers:
Three forms of tagging and the use of tags
A working example of how intelligence is extracted from tags
Database architecture for tagging
Developing tag clouds
Programming Collective Intelligence
Toby Segaran (ISBN: 978-0-596-52932-1)
http://shop.oreilly.com/product/9780596529321.do
 "Chapter 4. Searching and Ranking" covers:
Basic concepts of algorithms for search engine index
Design of a click-tracking neural network
Hope it helps.
Your problem is very difficult, but there is a plenty of related papers and books. Amazon Dynamo paper, yahoo PNUTS and this hadoop paper is a good examples.
So, at first, you must decide how your data will be distributed across cluster. Data must be evenly distributed across network, without hot spots. Consistent hashing will be a good solution for this problem. Also, data must be redundant, any entry need to be stored in several places to tolerate faults of individual nodes.
Next, you must decide how writes will occur in your system. Every write must be replicated across nodes that contains updated data entry. You might want to read about CAP theorem, and eventual consistency concept(wikipedia have a good article about both). Also, there is a consistency - latency tradeoff. You can use different mechanisms for writes replication: some kind of gossip protocol or state machine replication.
I don't know what kind of tagging do you mean, is this tags manually assigned to entries or learned from data. Anyway, this is a field of information retrieval(IR). You might use some kind of inverted index to effectively search entries by tags or keywords. Also, you must use some query result ranking algorithm.

Resources