Google Summer of Code: web classification dataset - dataset

I heard that Google hosted (or will host) a web classification competition and they provided a large (170k+ documents) dataset of web sites that were classified into multiple categories (sports, computers, science, etc.) I tried looking around in their Summer of Code web site for 2009 through 2011, but didn't find anything. Does anybody know where I can get that dataset?

I think I found it (although I'm not sure if the data was provided by google): the ECML/PKDD 2010 Discovery Challenge Data Set contains 22 training labels (i.e. labels about the content), URLs and hyperlinks, content-based and link-based web spam features, term frequencies and Natural Language Processing features.

Related

State-of-the-art protection against anti-virus in web applications

I plan to add anti-virus protection to our web application that is being built. I have a concern that even the limited amount of files (PDF files, images, or even unknown binaries) that the user uploads may contain viruses.
Concerns:
The images are shared with other users (exposed to web pages) may contain viruses.
The PDF files that users share with each other may contain viruses.
The API that I build for this web application handles the file upload and this API is the file server as well.
Are there any state-of-the-art approaches to minimize the exposure of users to malware, including techniques in the API or techniques on the client-side (browser)? More specifically, I'm interested in solutions that would scan files in the API itself (backend). The files may be stored in a database or on the file-system.
I definitely searched Github for open-source tools and packages, moreover, ran several searches on Google against terms like "open source anti-virus API", "open-source malware HTTP API", but could not find any. Broader search terms resulted in a huge amount of unrelated results.
A related and outdated question investigates a similar problem, but I'm looking for a solution that would integrate well into a micro-service architecture, like Kubernetes, moreover, I think a canonical answer would be useful from an expert.
There are definitely solutions that can help you and integrate into your web application via an API. Here are a few that I am aware of:
SophosLabs Intelix
Intelix is a threat intelligence platform that provides access via APIs through AWS Marketplace. There are three parts to the service lookups, static analysis and dynamic analysis. Each one will give a more detailed analysis of the file. Combining the three will give you a good protection for your web application.
VirusTotal
VirusTotal is a community that will provide you with aggregated information showing you what various anti-malware vendors will say about your file. While VT is a great service, one thing to watch here is that VT is focused on being a community and therefore files uploaded are shared with others.
Clam AV
Not one that I have personal experience of but Clam AV allow you to spin up a server and then query it using API. There is a tutorial / documentation here.
Others
If you tweak your google search and look for Sandboxes most offer an API for a fee. A couple that come to mind Joe Sandbox, Falcon Sandbox which powers Hybrid Analysis.
As always, be careful of any cloud service that offers you scanning for free. Most of the free tools will share the reports and/or files within their community.

Document Conversion Watson service not working?

I've been trying to use the IBM Watson Document Conversion service with the demo PDF, but it's not converting the document into little bits. All it's doing, is creating 1 answer unit, that's really long:
"text": "Watson is an artificially intelligent computer system capable of answering questions posed in natural language,[2] developed in IBM's DeepQA project by a research team led by principal investigator David Ferrucci. Watson was named after IBM's first CEO and industrialist Thomas J. Watson.[3][4] The computer system was specifically developed to answer questions on the quiz show Jeopardy![5] In 2011, Watson competed on Jeopardy! against former winners Brad Rutter and Ken Jennings.[3][6] Watson received the first place prize of $1 million.[7] Watson had access to 200 million pages of structured and unstructured content consuming four terabytes of disk storage[8] including the full text of Wikipedia,[9] but was not connected to the Internet during the game.[10][11] For each clue, Watson's three most probable responses were displayed on the television screen. Watson consistently outperformed its human opponents on the game's signaling device, but had trouble responding to a few categories, notably those having short clues containing only a few words. In February 2013, IBM announced that Watson software system's first commercial application would be for utilization management decisions in lung cancer treatment at Memorial Sloan- Kettering Cancer Center in conjunction with health insurance company WellPoint.[12] IBM Watson's former business chief Manoj Saxena says that 90% of nurses in the field who use Watson now follow its guidance.[13]"
Thanks in advance!
Unfortunately, that demo PDF is not the best document to use: Currently, Answer Units are split based on heading tags (h1 - h6), and that PDF doesn't contain any headers. =(
If you set the conversion_target to NORMALIZED_HTML, you'll be able to see the converted PDF before it is split up into Answer Units. It will contain paragraphs but no headings.
In the future, we expect to also allow splitting Answer Units by paragraph, but that hasn't been released yet.
UPDATE:
We updated the PDF on the demo site with one that's a much better example.

Storing 100k map markers in App Engine

I'm designing yet another "Find Objects near my location" web site and mobile app.
My requirements are:
Store up to 100k objects;
Query for objects that are close to the point (my location, city, etc). And other search criteria (like object type);
Display results on the Google Maps with smooth performance.
Let user filter objects by object time.
I'm thinking about using Google App Engine for this project.
Could You recommend what would be the best data storage option for this?
And couple of words about dynamic data loading strategy.
I kinda feel overwhelmed with options at the moment and looking for hints where should I continue my research.
Thanks a lot!
I'm going to to assume that you are using the datastore. I'm not familiar with Google Cloud SQL (which I believe aims to offer MySQL-like features in the cloud), so I can't speak if it can do geospatial queries.
I've been looking into the whole "get locations in proximity of a location" problem for a while now. I have some good and bad news for you, unfortunately.
The best way to do the proximity search in the Google Environment is via the Search Service (https://developers.google.com/appengine/docs/python/search/ or find the JAVA link ). Reason being is that it supports a "Geopoint Field", and allows you to query in such a way.
Ok, cool, so there is support, right? However, "A query is complex if its query string includes the name of a geopoint field or at least one OR or NOT boolean operator". The free quota for Complex Search Queries are 100/day. Per 10,000 queries, it costs 60 cents. Depending on your application, this may be an issue.
I'm not too familar with the Google Maps API you might be able to pull off something like this :(https://developers.google.com/maps/articles/phpsqlsearch_v3)
My current project/problem involves moving locations, and not "static" ones (stores, landmarks,etc). I've decided to go with Amazon's Dynamodb and they have a library which supports geospatial indexing : http://aws.amazon.com/about-aws/whats-new/2013/09/05/announcing-amazon-dynamodb-geospatial-indexing/

Web data extraction and data mining; Scraping vs Injection and how to get data.. like yesterday

I feel like i should almost give a friggin synopsis to this/these lengthy question(s)..
I apologize if all of these questions have been answered specifically in a previous question/answer post, but I have been unable to locate any that specifically addresses all of the following queries.
This question involves data extraction from the web (ie web scraping, data mining etc). I have spent almost a year doing research into these fields and how it can be applied to a certain industry. I have also familiarized myself with php and mysql/myphpmyadmin.
In a nutshell I am looking for a way to extract information from a site (probably several gigs worth) as fast and efficiently as possible. I have tried web scraping programs like scrapy and webharvey. I have also experimented with programs like HTTrack. All have their strengths and weaknesses. I have found that webharvey works pretty good yet it has its limitations when scraping images that are stored in gallery widgets. Also I find that many of the sites I am extracting from use other methods to make mining data a pain. It would take months to extract the data using webharvey. Which I can't complain given that I'd be extracting millions of rows worth of data exported in csv format into excel. But again, images and certain ajax widgets throw the program off when trying to extract image files.
So my questions are as follows:
Are there any quicker ways to extract said data?
Is there any way to get around the webharvey image limitations (ie only being able to extract one image within a gallery widget / not being able to follow sub-page links on sites that embed their crap funny and try to get cute with coding)?
Are their any ways to bypass site search form parameters that limit the number of search results (ie obtaining all business listings within an entire state instead of being limited to a county per search form restrictions)**
Also, this is public information so therefore it cannot be copyrighted; anybody can take it :) (case in point: Feist Publications v. Rural Telephone Service). Extracting information is extracting information. Its legal to extract as long as we are talking facts/public information.
So with that said, wouldn't the most efficient method (grey area here) of extracting this "public" information (assuming vulnerabilities existed), be through the use of sql injection?... If one was so inclined? :)
As a side question just how effective is Tor at obscuring ones IP address? Lol
Any help, feedback, suggestions or criticism would be greatly appreciated. I am by no means an expert in any of the above mentioned fields. I am just a motivated individual with a growing interest in programming and automation who has a lot of crazy ideas. Thank you.
You may be better off writing your own Linux command-line scraping program using either a headless browser library like PhantomJS (JavaScript), or a test framework like Selenium WebDriver (Java).
Once you have your scrape program completed, you can then scale it up by installing it on a cloud server (e.g. Amazon EC2, Linode, Google Compute Engine or Microsoft Azure) and duplicating the server image to as many are required.

What is the difference between CouchDB and Lotus Notes?

I was looking into the possibility of using CouchDB. I heard that it was similar to Lotus Notes which everyone loves to hate. Is this true?
Development of Lotus Notes began over 20 years ago, with version 1 released in 1989. It was developed by Ray Ozzie, currently Chief Software Architect for Microsoft.
Lotus Notes (the client) and Domino (the server) have been around for a long time and are mature well featured products. It has:
A full client server stack with rapid application design and deployment of document oriented databases.
A full public key infrastructure for security and encryption.
A robust replication model and active active clustering across heterogeneous platforms (someone once showed a domino cluster with an xbox and a huge AIX server).
A built in native directory for managing users that can also be accessed over LDAP.
A built in native mail system that can scale to manage millions of users with multi GB mail files, with live server access or replicated locally for off-line access. This can interface with standard internet mail through SMTP and also has POP and IMAP access built in. The mail infrastructure is a core feature that is available to all applications built on Notes Domino (any document in a database can be mailed to any other database with a simple doc.send() command).
A built in HTTP stack that allows server hosted databases to be accessed over the web.
A host of integration options for accessing, transferring and interoperating with RDBMS and ERP systems, with a closely coupled integration with DB2 available allowing Notes databases to be backed by a relational store where desired.
Backwards compatibility has always been a strong feature of Notes Domino and it is not uncommon to find databases that were developed for version 3 running flawlessly in the most up to date versions. IBM puts a huge amount of effort into this and it has a large bearing on how the product currently operates.
-
CouchDB was created by Damien Katz, starting development in 2004. He had previously worked for IBM on Notes Domino, developing templates and eventually completely rewriting one of the core features, the formula engine, for ND6.
CouchDB shares a basic concept of a document oriented database with views that Notes Domino has.
In this model "documents" are just arbitrary collections of values that are stored some how. In CouchDB the documents are JSON objects of arbitrary complexity. In Notes the values are simple name value pairs, where the values can be strings, numbers, dates or arrays of those.
Views are indexes of the documents in the database, displaying certain value, calculating others and excluding undesired docs. Once the index is build they are incrementally updated when any document in the database changes (created updated or deleted).
In CouchDB views are build by running a mapping function on each document in the database. The mapping function calls an emit method with a JSON object for every index entry it wants to create for the given document. This JSON object can be arbitrarily complex. CouchDB can then run a second reducing function on the mapped index of the view.
In Notes Domino views are built by running a select function (written in Notes Domino formula language) on each document in the database. The select function simply defines if the document should be in the view or not. Notes Domino view design also defines a number of columns for the view. Each column has a formula that is run against the selected document to determine the value for that column.
CouchDB is able to produce much more sophisticated view indexes than Notes Domino can.
CouchDB also has a replication system.
-
Summary ( TL;DR ) : CouchDB is brand new software that is developing a core that has a similar conceptual but far more sophisticated design to that used in Lotus Notes Domino. Lotus Notes Domino is a mature fully featured product that is capable of being deployed today. CouchDB is starting from scratch, building a solid foundation for future feature development. Lotus Notes Domino is continuing to develop new features, but is doing so on a 20 year old platform that strives to maintain backwards compatibility. There are features in Notes Domino that you might wish were in CouchDB, but there are also features in Notes Domino that are anachronistic in today's world.
It is the Notes application and UI that people usually hates. Not the architecture behind.
Damien Katz worked at Iris (Lotus), but he was not the guy behind the Notes Database. He is well-known in the Lotus Notes community for redesigning the Notes Formula Engine.
There are definitely some similarities between CouchDB and Lotus Notes, such as their document-oriented, non-relational data, and replication capabilities, but they are more disparate than similar. CouchDB is a database server and Lotus Notes is an enterprise-level collaboration platform.
#Lex, You should prehaps say what version of Notes/Domino you are working on because your comments are incorrect.
"No transaction support" - Domino has transactional logging. If you want more complex transaction logging that is also available within coding.
"not well suited for handling multiple data transactions" - Actually it handles them just fine. You have document locking and replication conflict resolution. Depends a lot on how you set up your application to handle workflow.
"No separation between production/dev environments." - False. The only way this could be true is if you had a badly deployed environment. Developers normally should have 0 access to deploy design changes to the production environment. They would work off a template which does not replicate to main servers. Once updates are done and approved then the administrator deploys it. They do this by taking the template and signing it with a controlled signature allowed to run on production, then drop the template in and update the design of the related applications.
"The more data lotus notes contains, the more views will likely get created" - This comment makes absolutly no sense what-so-ever. I don't believe you have used Notes/Domino in any professional ability.
"lotus script is not object oriented" - Yes you make good points there. However it doesn't mean that the language is flawed. Also they have made a large number of improvements since 8.x and with 8.5.1. For example built in web services support (point to WSDL and LS code is made for you). 8.5.1 Also has a lot of new designer features like Code Templates, auto-completion, LSDoc popup help on your own functions, etc.
You also only touch on LotusScript. Yet you can also code in:
Java, SSJS/DOJO (XPages), Javascript, #Formula language, Web Services (SOAP/REST), C-API, Eclipse Plugins(RCP). Output in JSON as well as XML.
8.5.1 Designer client is now free to download if you want to test it out.
So while I believe I am not in a position to comment on CouchDb you most certainly are not on Notes/Domino.
Lotus Notes client/Domino server is comprised of an object("document")-storage (not relational) mechanism, has fully integrated certificate-based security model / user management and conflict-resolution for syncing offline/online changes to data - it's a platform for distributed applications.
"CouchDB is a document-oriented, Non-Relational Database Management Server (NRDBMS)."
CouchDB is accessible via a REST style API.
There's a podcast interview with Jan Lehnardt of the CouchDB team here.
Without going back and listening to it again, I believe that Damien Katz, who was the initiator and is still the lead developer on CouchDB was also the guy behind the Notes database. So there's a sense in which CouchDB is a better Notes DB, I guess. He explains some of the differences in his blog.
It's similar to how Notes deals with data in that everything is a document of arbitrary structure, and you have views over those documents instead of tables and records like you'd have in a relational database. The replication etc also has some similarities.
There isn't anything wrong with the Notes server architecture, people don't hate that so much. It's more the implementation and bloat that comes with Notes.
CouchDB has no front end either, just a server component. The Notes client sucks, and that is what people REALLY hate. Have you ever tried to email uh I mean "memo" something from Notes? Not pleasant :(
Comparing Apples & Oranges
Lotus Notes Domino hasn't changed much and there is not a NoSQL service option on-prem or cloud for Notes Domino v12 or any earlier version. Domino is not cloud based tech.
When it comes to NoSQL, Domino uses NoSQL for its own application solutions built in Domino. There was an attempt with Domino Access Services which is based on Java 6, Rest API still uses Vectors in v12. This service is ok, not robust, it provided a way to interface with data in a NSF. Remember, Domino is key value pairs storage and very slow on large data sets because of the security model, each document is checked for readers and authors with every search to identify if the document can be viewed by the user. Domino is still Web 1.0.
With CouchDB one can build app on mobile and deploy it. There is no way to do the same with Notes/Domino because of the Domino Server. Domino dev also only supports MS Windows and the IDE is based on older versions of Eclipse, to this day v12, there is no way to use dual monitors with the Domino IDE. Ask any Domino Developer, they hate being forced to use a IDE on a specific platform that cannot keep up with industry.
Couch has gone through many changes as well, brief history:
CouchDB started by Damian Katz, IBM Lotus Domino engineer
Apache project BigCouch is born ; scalability and clustering added
Cloudant is born ; BigData and IBM funding and IBM Cloud offering
CouchDB 2.0 is born; Cloudant + BigData merged back into CouchDB
CouchDB 3.0 is born; Enhanced security and prep for Foundation DB
CouchDB 4.0 is born; architecture changed to Apples Foundation DB
https://www.dataengineeringpodcast.com/couchdb-document-database-episode-124/

Resources