Large classification document corpus

Large classification document corpus - dataset

Can anyone point me to some large corpus that I use for classification?
But by large I don't mean Reuters or 20 newsgroups, I'm talking about a corpus of GB size, not 20MB or something like that.
I was able only to find this Reuters and 20 newsgroups, which is very small for the thing I need.

The most popular datasets for text-classification evaluation are:
Reuters Dataset
20 Newsgroup Dataset
However the datasets above does not meet the 'large' requirement. Below datasets might meet your criteria:
Commoncrawl You could build a large corpus by extracting articles that have specific keywords in the meta tag and apply to document classification.
Enron Email Dataset You could do a variety of different classifcation tasks here.
Topic Annotated Enron Dataset . Not free but already labelled and meets your large corpus request
You can browse other publicly available datasets here
Other than the above you might have to develop your own corpus.I will be releasing a news corpus builder later this weekend that will help you develop custom corpora based on topics of your choice
Update:
Had created the custom corpus builder module I mentioned above but forgot to link it News Corpus Builder

Huge Reddit archive spanning 10/2007 to 5/2015

Related

What is the right database technology for this simple outlined BI tool use case?

Reaching out to the community to pressure test our internal thinking.
We are building a simplified business intelligence platform that will aggregate metrics (i.e. traffic, backlinks) and text list (i.e search keywords, used technologies) from several data providers.
The data will be somewhat loosely structured and may change over time with vendors potentially changing their response formats.
Data volume may be long term 100,000 rows x 25 input vectors.
Data would be updated and read continuously but not at massive concurrent volume.
We'd expect to need to do some ETL transformations on the gathered data from partners along the way to the UI (e.g show trending information over the past five captured data points).
We'd want to archive every single data snapshot (i.e. version it) vs just storing the most current data point.
The persistence technology should be readily available through AWS.
Our assumption is our requirements lend themselves best towards DynamoDB (vs Amazon Neptune or Redshift or Aurora).
Is that fair to assume? Are there any other questions / information I can provide to elicit input from this community?

Because of your requirement to have a schema-less structure, and to version each item, DynamoDB is a great choice. You will likely want to build the table as a composite Partition/Sort key structure, with the Sort key being the Version, and there are several techniques you can use to help you locate the 'latest' version etc. This is a very common pattern, and with DDB Autoscaling you can ensure that you only provision the amount of capacity that you actually need.

How to process natural language query into solr understandable query

What i am doing and what i did so far:
i'm developing an question and answering system using Solr,i took product reviews as my data-set(contains product id and its reviews from different users) which is in json format.i have performed indexing on my data-set and successfully got the response of indexed data.
Requirements:
In my Q/A system i will provide query in Natural language for example, "why should i buy X(product name)" and my Q/A should be capable of recognizing the words in reviews like "its ease to use, flexible product" and it should frame its answer depending on those words.
I would like to know the following
How can i process natural language query into solr executable query,
How can i prepare my answer to the query,
What kind of nlp models should i use,
How should i train my Q/A system
and any other information which can help me to achieve the requirements.

You are nowhere near Solr yet. You have to go back and look for the actual NLP (Natural Language Processing) system. If it uses Solr (or OpenNLP that integrates with Solr) - great. If not, you have to invent this bridge, it does not just come with Solr, as this is still a cutting edge of research.

How to cluster images from a large dataset into groups

I want cluster an image dataset into several groups using K-means, N-cut or other algorithm, but I don't know how to process those images in the dataset first. These groups should have their own special features. Anyone has any suggestions?

My suggestion is that you go ahead and try a number of features.
Which feature works best for your is very much dependent on your use case.
If you are hoping to group photos by mood, group faces by users or group CAD drawings by the type of gear on it require completely different feature extraction approaches. So you will have to kiss a few frogs to find your prince.

You use the mosaic dataset.
A mosaic dataset allows you to store, manage, view, and query small to vast collections of raster and image data. It is a data model within the geodatabase used to manage a collection of raster datasets (images) stored as a catalog and viewed as a mosaicked image. Mosaic datasets have advanced raster querying capabilities and processing functions and can also be used as a source for serving image services. find more here
And also I refere you to this paper-article:
title: CREATING AN IMAGE DATASET TO MEET YOUR CLASSIFICATION NEEDS:A PROOF-OF-CONCEPT STUDY
By:
James D. Hurd, Research Associate
Daniel L. Civco, Director and Professor

Web data extraction and data mining; Scraping vs Injection and how to get data.. like yesterday

I feel like i should almost give a friggin synopsis to this/these lengthy question(s)..
I apologize if all of these questions have been answered specifically in a previous question/answer post, but I have been unable to locate any that specifically addresses all of the following queries.
This question involves data extraction from the web (ie web scraping, data mining etc). I have spent almost a year doing research into these fields and how it can be applied to a certain industry. I have also familiarized myself with php and mysql/myphpmyadmin.
In a nutshell I am looking for a way to extract information from a site (probably several gigs worth) as fast and efficiently as possible. I have tried web scraping programs like scrapy and webharvey. I have also experimented with programs like HTTrack. All have their strengths and weaknesses. I have found that webharvey works pretty good yet it has its limitations when scraping images that are stored in gallery widgets. Also I find that many of the sites I am extracting from use other methods to make mining data a pain. It would take months to extract the data using webharvey. Which I can't complain given that I'd be extracting millions of rows worth of data exported in csv format into excel. But again, images and certain ajax widgets throw the program off when trying to extract image files.
So my questions are as follows:
Are there any quicker ways to extract said data?
Is there any way to get around the webharvey image limitations (ie only being able to extract one image within a gallery widget / not being able to follow sub-page links on sites that embed their crap funny and try to get cute with coding)?
Are their any ways to bypass site search form parameters that limit the number of search results (ie obtaining all business listings within an entire state instead of being limited to a county per search form restrictions)**
Also, this is public information so therefore it cannot be copyrighted; anybody can take it :) (case in point: Feist Publications v. Rural Telephone Service). Extracting information is extracting information. Its legal to extract as long as we are talking facts/public information.
So with that said, wouldn't the most efficient method (grey area here) of extracting this "public" information (assuming vulnerabilities existed), be through the use of sql injection?... If one was so inclined? :)
As a side question just how effective is Tor at obscuring ones IP address? Lol
Any help, feedback, suggestions or criticism would be greatly appreciated. I am by no means an expert in any of the above mentioned fields. I am just a motivated individual with a growing interest in programming and automation who has a lot of crazy ideas. Thank you.

You may be better off writing your own Linux command-line scraping program using either a headless browser library like PhantomJS (JavaScript), or a test framework like Selenium WebDriver (Java).
Once you have your scrape program completed, you can then scale it up by installing it on a cloud server (e.g. Amazon EC2, Linode, Google Compute Engine or Microsoft Azure) and duplicating the server image to as many are required.

Big Query vs Text Search API

I wonder if Big Query is going to replace/compete with Text Search API? It is kinda stupid question, but Text Search API is in beta for few months and has very strict API calls limit. Bug Big Query is already there and looks very promising. Any hints what to chose to search over constantly coming error logs?

Google BigQuery and the App Engine Search API fulfill the needs of different types of applications.
BigQuery is excellent for aggregate queries (think: full table scans) over fixed schema data structures in very very large tables. The aim is speed and flexibility. BigQuery lacks the concept of indexes (by design). While it can be used for "needle in a haystack" type searches, it really shines over large, structured datasets with a fixed schema. In terms of document type searches, BigQuery records have a fixed maximum size, and so are not ideal for document search engines. So, I would use BigQuery for queries such as: In my 200Gb log files, what are the 10 most common referral domains, and how often did I see them?
The Search API provides sorted search results over various types of document data (text, HTML, geopoint etc). Search API is really great for queries such as finding particular occurrences of documents that contain a particular string. In general, the Search API is great for document retrieval based on a query input.