How can I find a text-to-emoji dataset for a neural network? - dataset

I'm building a neural network that takes in a text prompt and returns an emoji. I've found similar things on the web but they only return a few emojis, I want to create something so specific that it can return all 4581 different ones.
I've found this dataset containing every single emoji, but I'm struggling to find a dataset suitable for training the model. This is my 3rd Neural Network, so I'm still pretty new to this stuff. It would be great if I could get a big dataset that I could use for training the model, or maybe some way to gather the data from the web.

Related

Web data extraction and data mining; Scraping vs Injection and how to get data.. like yesterday

I feel like i should almost give a friggin synopsis to this/these lengthy question(s)..
I apologize if all of these questions have been answered specifically in a previous question/answer post, but I have been unable to locate any that specifically addresses all of the following queries.
This question involves data extraction from the web (ie web scraping, data mining etc). I have spent almost a year doing research into these fields and how it can be applied to a certain industry. I have also familiarized myself with php and mysql/myphpmyadmin.
In a nutshell I am looking for a way to extract information from a site (probably several gigs worth) as fast and efficiently as possible. I have tried web scraping programs like scrapy and webharvey. I have also experimented with programs like HTTrack. All have their strengths and weaknesses. I have found that webharvey works pretty good yet it has its limitations when scraping images that are stored in gallery widgets. Also I find that many of the sites I am extracting from use other methods to make mining data a pain. It would take months to extract the data using webharvey. Which I can't complain given that I'd be extracting millions of rows worth of data exported in csv format into excel. But again, images and certain ajax widgets throw the program off when trying to extract image files.
So my questions are as follows:
Are there any quicker ways to extract said data?
Is there any way to get around the webharvey image limitations (ie only being able to extract one image within a gallery widget / not being able to follow sub-page links on sites that embed their crap funny and try to get cute with coding)?
Are their any ways to bypass site search form parameters that limit the number of search results (ie obtaining all business listings within an entire state instead of being limited to a county per search form restrictions)**
Also, this is public information so therefore it cannot be copyrighted; anybody can take it :) (case in point: Feist Publications v. Rural Telephone Service). Extracting information is extracting information. Its legal to extract as long as we are talking facts/public information.
So with that said, wouldn't the most efficient method (grey area here) of extracting this "public" information (assuming vulnerabilities existed), be through the use of sql injection?... If one was so inclined? :)
As a side question just how effective is Tor at obscuring ones IP address? Lol
Any help, feedback, suggestions or criticism would be greatly appreciated. I am by no means an expert in any of the above mentioned fields. I am just a motivated individual with a growing interest in programming and automation who has a lot of crazy ideas. Thank you.
You may be better off writing your own Linux command-line scraping program using either a headless browser library like PhantomJS (JavaScript), or a test framework like Selenium WebDriver (Java).
Once you have your scrape program completed, you can then scale it up by installing it on a cloud server (e.g. Amazon EC2, Linode, Google Compute Engine or Microsoft Azure) and duplicating the server image to as many are required.

Save Informations as a Data Net

My aim is to write an intelligent ChatBot. He should save known informations likely to the human brain.
That is why I am looking for a filetype wich stores data as a net of connected keywords. What filetype or database system could reach this?
Further Informations:
The information input will be wikipedia, google search, and facts teached by a human during a conversation.
I could give specific informations about my requirements and wishes but I don't know if there exists even any approach to this. Maybe there are more useful specifications as my thoughts.
Just one example: the connections should have weights. Requesting an information net should increase the weights of the used connections.
What I expect is that the ChatBot could get real associations (or ideas) using the data net.
As an extension to my above comments:
A graph is definitely the way you want to go in terms of data representation...it maps perfectly to your problem description.
What you seem to be asking is how you can [persistently] store this information on disk (rather than memory). That completely depends on what constraints you need. There is a "Graph Database" which is more geared to storing graphs than say relational or hierarchical databases, and would be perform far better than say pushing your adjacency matrix or list to a flat file. Here's the wikipedia entry:
http://en.wikipedia.org/wiki/Graph_database
Now, there is the issue of what happens when you have so many nodes and edges that you can't load them all into memory at once, and unfortunately if you have nodes that are connected to every other node, that can be a problem (because you won't be able to load the complete/valid graph. I can't answer that right now, but I'm sure there are paradigms to address this problem. I will update my answer after some digging.
Edit-You'll probably have to consult someone who knows more about graph databases. It's possible that there are ways to load chunks of the graph from the database without loading the whole thing. If that's what your issue is, you may want to reform a question about working with large graphs stored on graph databases and post it again, tagged with graphs,databases,algorithms, stuff like that, and just post it again in a more specific manner.

Web client classification using Artifical Neural Network

I have a high traffic web site.
I want to create software which analyses client requests on-the-fly and decide if they come from a real user or a botnet bot. For training the neural network to identify legitimate ("good") users I can use logs when there are no DDoS activity. Once trained, the network would distinguish real users from bots.
What I have:
request URI (and order)
cookie
user agent
request frequency.
Any ideas on how to best design ANN for this task and how to tune it?
Edit: [in response to comments about the overly broad scope of this question]
I currently have a working C# program which blocks clients on the basis the frequency of identical requests. Now I'd like to improve its "intelligence" with a classifier based on neural network.
I don't know how to normalize these inputs for ANN and I need suggestions in this specific area.
This isn't really suited to neural networks. Neural networks are great provided (as a rough guide):
You can spare the processing power,
The data is not temporal,
The input data is finite,
I don't think that you really pass any of these.
Re: normalizing the inputs: you map your input data to a set of symbols (which are then turned into numbers) or you map the inputs to a floating point number where the number represents some degree of intensity. You can map any kind of data to any kind of scheme but you would really only want to use ANN's when the problem solution is nonlinear (all the data for one classification of another classification CAN'T be clustered on one side of a line with all the data for the other classification on the other side of the line). In both cases you end up with a vector of inputs associated with an output ([BOT, HUMAN], or [BOT, HUMAN, UNKNOWN] or [BOT, PROBABLY-BOT, PROBABLY-HUMAN, HUMAN], etc).
How do you distinguish between two users coincidentally submitting the exact same book request equentially in time (let's assume you are selling books)?

Environmental database design

I've never designed a database before, but I've had experience programming in a few languages and assembler throughout college, as well as some web design, so I'm able to at least pick up what I need to know if I can be pointed in the right direction. One of the tasks of my job is to sort through some data that we've been collecting in the field, using a "sonde" which measures temperature, pH, conductivity, and other parameters. The device sits in a stream 24/7 (except for when we take it out and switch it with our other sonde every couple weeks, so that we can put in a newly calibrated one in the stream and retrieve the data from the one that was in the field). It collects data every 15 minutes or so, and has done so since 2007. Currently, all of our data is spread across multiple excel spreadsheets, and we have additional data from a weather station and another instrument that all gets compiled into quarterly documents. My goal is to design as simple of a database as possible with most of the functionality of a database like this: http://hudson.dl.stevens-tech.edu/hrecos/d/index.shtml. Ours would be significantly simpler as it is not live data (but would instead retrieve data from files that we upload once we'd finished handling the formatting and compilation of all our data). I would very much like the graphing ability on the site that the above database has, but I at least need to be able to select a range of data and select as many variables as I want within that time range and then be able to download a spreadsheet with the generated data (or at least a CSV file).
I realize this is a tough task, and as I have not designed a database before, I suspect it is very much an uphill task. However if I would be able to learn the things necessary to do this, and make it web-accessible, that would be a huge accomplishment and very much impress my boss. Any advice or tips to go off in the right direction would be very much appreciated.
Thanks for your help!
There are actually 2 parts to the solution you're looking for:
The database, which will store your data in a single organized place, and
The application, which is the interface used by people to interact with the database.
Basically, a database by itself is just a container. You need some kind of application which accept criteria from a user, pull the appropriate data meeting the criteria from the database, and display it to the user in a meaningful fashion - in this case, a graph or a spreadsheet.
Normally for web-based apps the database and application are two separate components. However, for a small app with a fairly small number of users, and especially for someone just starting out, you may want to consider an all-in-one solution like InfoDome, sort of like MSAccess for the web.
Either way, you're still going to need to learn about database design. There's many good tutorials out there, just do some searching. DatabaseAnswers.org has been useful for me. They have a set of tutorials as well as a large collection of sample database schemas.

Managing a large collection of music

I'd like to write my own music streaming web application for my personal use but I'm racking my brain on how to manage it. Existing music and their location's rarely change but are still capable of (fixing filename, ID3 tags, /The Chemical Brothers instead of /Chemical Brothers). How would the community manage all of these files? I can gather a lot of information through just an ID3 reader and my file system but it would also be nice to keep track of how often played and such. Would using iTunes's .xml file be a good choice? Just keeping my music current in iTunes and basing my web applications data off of it? I was thinking of keeping track of all my music by md5'ing the file and using that as the unique identifier but if I change the ID3 tags will that change the md5 value?
I suppose my real question is, how can you keep track of large amounts of music? Keep the meta info in a database? Just how I would connect the file and db entry is my real question or just use a read when need filesystem setup.
I missed part 2 of your question (the md5 thing). I don't think an MD5/SHA/... solution will work well because they don't allow you to find doubles in your collection (like popular tracks that appear on many different samplers). And especially with big collections, that's something you will want to do someday.
There's a technique called acoustic fingerprinting that shows a lot of promise, have a look here for a quick intro. Even if there are minor differences in recording levels (like those popular "normalized" tracks), the acoustic fingerprint should remain the same - I say should, because none of the techniques I tested is really 100% errorfree. Another advantage of these acoustic fingerprints is that they can help you with tagging: a service like FreeDB will only work on complete CD's, acoustic fingerprints can identify single tracks.
For inspiration, and maybe even for a complete solution, check out ampache. I don't know what you call large, but ampache (a php application backed by a mysql db) easily handles music collections of tens of thousands of tracks.
Reecently I discovered SubSonic, and the web site says "Manage 100,000+ files in your music collection without hazzle" bt I haven't been able to test it yet. It's written in Java and the source looks pretty neat at first sight, so maybe there's inspiration to get there too.

Resources