Dataset for task fMRI - dataset

I can't find any preprocessed dataset for task-fmri on which I can apply dynamic causal modeling (Friston et al. 2003). Does anyone know any dataset (other than the 3 region attention to motion dataset)? Thanks.
Most papers use different datasets but these datasets are not shared. Raw versions of these datasets are available and some papers explain preprocessing steps although for reproducibility purposes it's helpful to have the preprocessed version readily available.

Related

Distributed Text Clustering with Slurm

I want to build a distributed AI-based text classification solution (e.g. based on distributed k-means) , which should work on my cluster based on Slurm. The solution should cluster the input-documents so that similar documents will be grouped together.
However, I am not sure, which frameworks etc. to use - has someone ideas how I could approach this?
Be careful, the word 'classification' is used for describing a supervised task trained with labels. What you're describing is text clustering, which is unsupervised with no labels.
More precisely, what you're describing is topic modelling, a standard task in NLP.
There are various algorithms, the most standard is probably LDA. There are also more recent approaches with DL, for example Bertopic.
About distributing with Slurm, there are apparently options as well, for example with Spark (apparently Spark can be used on top of Slurm.)

Ground Truth datasets for Evaluating Open Source NLP tools for Named Entity Recognition

I am working on building a document similarity graph for a collection. I already do all the basic things like tokenization, stemming, stop-word removal, and bag-of-word representation to represent the documents and computing similarity using Jaccard coefficient. I am now trying to extract Named Entities and evaluate if these would be helpful in improving the quality of the document similarity graph. I have been spending much of time on finding ground-truth datasets for my analysis. I have been very disappointed with Message Understanding Conference (MUC) datasets. They are cryptic to understand and requires sufficient data cleaning/massaging before it can be used on a different platform (like Scala)
My questions are here more specifically
Are there tutorials on getting started with MUC datasets that would make it easier for analyzing the results using open source NLP tools like openNLP
there other datasets available?
Tools like OpenNLP and Stanford Core NLP employ approaches that are essentially supervised. Correct?
GATE is a great tool for hand-annotating your own text corpus Correct?
For a new test dataset (that I hand-create) how can I compute the baseline (Vocabulary Transfer) or what kind of metrics can I compute?
First of all, I have a few concerns about using Jaccard coefficient to compute similarity. I'd expect TF.IDF and cosinus similarity to give better results.
Some answers to your questions:
See the CoNLL 203 evaluation campaign: it also provides data, evaluation tools, etc. You ma also have a look at ACE.
Yes
Gate is also a pipeline that automatically annotates text, but as far as I know NER is a rule-based component.
A baseline is most of the time a very simple algorithm (e.g. majority classes) so it is not a baseline to compare corpus, but to compare approaches.

hadoop vs teradata what is the difference

I've touched a Teradata. I've never touched hadoop, but since yesterday, I am doing some research on that. By description of both, they seem quite interchangable, but in some papers it is written that they serve for different purposes. But all I found is vague. I am confused.
Has anybody experience with both of them? What is the serious difference between them?
Simple Example: I want to build ETL which will transform billions rows of raw data and organize them to DWH. Then do some resources expensive analysis on them. Why use TD? Why Hadoop? or why not?
I think this article titled 'MapReduce and Parallel DBMSs: Friends or Foes' does quite a good job describing the situations where each technology works best. In a nutshell, Hadoop is excellent for storing unstructured data and running parallel transformations to 'sanitize' incoming data, where DBMSs excel at executing complex queries quickly.
Hadoop, Hadoop with Extensions, RDBMS Feature/Property Comparison
I am not an expert in this area, but in the coursera.com course, Introduction to Data Science, there is a lecture titled: Comparing MapReduce and Databases as well as a lecture on Parallel databases within the map reduce section of the course.
Here is a summary from these lectures on the comparison of MapReduce vs. RDBMS (not necessarily parallel RDMBS).
One point to remember is that the comparison is different if you include extensions to Hadoop like PIG, Hive, etc. I will put in () MapReduce extensions that add some of these functionality/properties.
Some functionality/properties that RDBMS have but not native MapReduce:
Declaritive query languages -(Pig, HIVE)
Schemas (Hive, Pig, DyradLINQ, Hadapt)
Logical Data Independence
Indexing (Hbase)
Algebraic Optimization (Pig, Dryad, HIVE)
Caching/Materialized Views
ACID/Transactions
MapReduce (relative to regular RDBMS not necessarily Parallel RDMBS)
High Scalability
Fault-tolerance
“One-person deployment”
I've been asked this question several times, the answer that I usually give is a car analogy (which is pretty silly because I'm not a car person - but it seems to work)
Teradata is the car/dbms for the masses - it is reliable, mature, works well and is there when you need it. It is difficult (compared to Hadoop) to customise and add functionality to the base product.
Hadoop is the car/dbms for the enthusiast - it isn't as reliable or mature, it works well so long as you attend to it. It is easy (compared to Teradata) to customise and add functionality to the base product.
Put another way, Teradata is the reliable workhorse where you put your mission critical process (operational reporting, enterprise reporting, decision support etc).
Hadoop is the place where you can do alot of this stuff, but don't be surprised if you come in one morning and find that your regulatory reports can't be produced because someone applied a patch or you've suddenly got a "too many small files" problem.
To loop back into the analogy, if you don't want to be too techy and the manufacturers product (dbms and/or car) works for you out of the box, Teradata is a good option.
On the other hand, if you like to tinker under the hood, swap out the carburettor (or whatever), adjust the gear ratios, tweak the fuel air mixture depending on whether you are country or city driving, bolt on a Turbo charger and/or your family complain about how long you spend in the garage on weekends - Hadoop is the place for you.
IMHO, Most, if not all organisations need both.
I hope this helps :-)
To Begin with, Vanilla Apache Hadoop is 100% open source. But if you need commercial support along with consultancy there are companies like Cloudera, MapR, HortonWorks, etc.
Hadoop is backed by a growing community fixing bugs and making improvements on a consistent basis. Hadoop storage model HDFS is based on Google's GFS architecture which is proven to handle large quantities of data. Furthermore Hadoop analysis model Map Reduce is based on Google's Map Reduce Model.
Hadoop is used by Tech Giants like Facebook, Yahoo, Twitter, EBay etc to store and analysis they high volume of data real time as well as passively.
For your question ETL systems read these slides where you will see.
Ok now Why Hadoop?
Open Source
Proven Storage and Analysis model for Large Quantities of data
Minimum Hardware Requirement to setup and run.
Ok now Why TD?
Commercial Support

Association Rule Mining on a FOAF dataset of social networks

I am working on a project called "association rule discovery from social network data: Introducing Data Mining to the Semantic Web". Can anyone suggest a good source for an algorithm (and its code. I heard that it can be implemented using Perl and also R packages) to find association rules from a social network database?
The snapshot of the database can be got in the following link: https://docs.google.com/uc?id=0B0mXGRdRowo1MDZlY2Q0NDYtYjlhMi00MmNjLWFiMWEtOGQ0MjA3NjUyZTE5&export=download&hl=en_US
The dataset is available on the following link: http://ebiquity.umbc.edu/get/a/resource/82.zip
I have searched a lot regarding this project but unfortunately can't find something useful as yet. The following link I found somewhat related:
Criminal data : http://www.computer.org/portal/web/csdl/doi/10.1109/CSE.2009.435
Your help will be highly appreciated.
Thank You,
Well, the most widely used implementations of the original Association Rules algorithm (originally developed at IBM Almaden Research Center) are Apriori, and Eclat, in particular, the C implementations by Christian Borgelt.
(Brief summary for anyone not familiar with Association Rules (aka "Frequent Items Sets", or "Market Basket Analysis"). The prototype application for Association Rules is analyzing consumer transactions, e.g., supermarket data: Among shoppers who buy polish sausage what percentage of those also also purchase black bread?)
I would recommend the statistical platform, R. It is free and open source, and its package repository contains (at least) four libraries directed solely to Association Rules, all with excellent documentation--three of the four Packages include a Manual and a separate Vignette (informal prose document with code examples). Both the Manuals and Vignettes contain numerous examples in R code.
I have used three of the four Packages below and i can recommend those three highly. Among them are bindings for Eclat and Apriori. These libraries are distributed as R 'Packages', which are available on CRAN, R's primary Package repository. Basic installation and setup of R is trivial--there are binaries for Mac, Linux, and Windows, available from the link above. Likewise, Package installation/integration is as simple as you would expect from an integrated platform (though not every one of the four Packages listed below have binaries for every OS though).
So on CRAN, you will find these Packages all directed solely Association Rules:
arules
arulesNBMiner
arulesSequences
arulesViz
This set of four R Packages is comprised of R bindings for four different Association Rules implementations, as well as a visualization library.
The first package, arules, includes R bindings for Eclat and Apriori. The second, arulesNBMiner, is the bindings for Michael Hahsler's Association Rules algorithm NB-frequent itemsets by . The third, arules Sequences, is the bindings for Mohammed Zaki's cSPADE .
The last of these is particularly useful because it is a visualization library for plotting the output from any of the previous three packages. For your social network study, i suspect you will find the graph visualization--i.e., explicit visualization of the nodes (users in the data set) and edges (connections between them).
This is a bit broader than http://en.wikipedia.org/wiki/Association_rule_learning but hopefully useful.
Some earlier FOAF work that might be interesting (SVD/PCA etc):
http://stderr.org/~elw/foaf/
http://www.scribd.com/doc/353326/The-Social-Semantics-of-LiveJournal-FOAF-Structure-and-Change-from-2004-to-2005
http://datamining.sztaki.hu/files/snakdd.pdf
Also Ch.4 of http://www.amazon.com/Understanding-Complex-Datasets-Decompositions-Knowledge/dp/1584888326 is devoted to the application of matrix decomposition techniques against graph data structures; strongly recommended.
Finally, Apache Mahout is the natural choice for large scale data mining, machine learning etc., https://cwiki.apache.org/MAHOUT/dimensional-reduction.html
If you want some Java code, you can check my website for the SPMF software. It provides source code for more than 45 algorithms for frequent itemset mining, association mining, sequential pattern mining, etc.
Moreover, it does not only provide the most popular algorithms. It also offers many variations such as mining rare itemsets, high utility itemsets, uncertain itemsets, non redundant association rules, closed association rules, indirect association rules, top-k association rules, and much more...

Datasets for Apache Mahout

I am looking for datasets that can be used for implementing recommendation system usecase of Apache Mahout. I know of only MovieLens Data Sets from GroupLens Research group.
Anyone knows any other datasets that can be used for recommendation system implementation? I am particularly interested in item-based data sets though other datasets are most welcome.
this is Sebastian from Mahout.
There is a dataset from a czech dating website available that might be of interest to you: http://www.occamslab.com/petricek/data/
Btw the term item-based refers to a special collaborative filtering approach not to the dataset itself, which is usually in the common form of user-item-rating tripels that most collaborative filtering approaches work with.
We would love to hear from your experimentation results and experiences (if you wanna share them) on our user mailinglist at user#mahout.apache.org
While searching for data sets, I found few sites that list publicly available data sets which can used for data mining. Some of these can be used for Mahout too.
Bixo Labs
UCI Datasets
KDnuggets
You can look at iPinYou RTB Bidding Data Set
Quora : http://qr.ae/OrqgM
http://contest.ipinyou.com/data-release.html

Resources