I'm interesting in finding songs based on attributes (minor key tonality, etc). These are things listed in the details of why Pandora picks songs, but using Pandora, I have to give it songs/artists.
Is there any way to get the Music Genome database (or something similar) so I can search for songs based on attributes (that someone else has already cataloged)
You can use Gracenote's Global Media Database and search with Track-level attributes.
"Gracenote's Media Technology Lab scientists and engineers take things further by utilizing technologies like Machine-Listening and Digital Signal Processing to create deep and detailed track level descriptors such as Mood and Tempo."
I don't think there is any way to access this proprietary data, something I asked them about long ago. It seems to me they want to protect this unique part of their system; after all, they've paid for the man hours to label each song. Even if Pandora releases a developer API, which they've hinted at, I doubt it will provide access to the Music Genome information.
Give Echo Nest a shot!
To add to above answers, Pandora's statement (as viewed using the above link in combination with the Internet Archive) was:
"A number of folks also asked about the prospect for an open API, to allow individual developers to start building on the platform. We're not there yet, but it's certainly food for thought."
Given that this was seven years ago, I think their decision is pretty clear.
Related
For a personal interest I try to define a simulated AI who is based on information that he learned and internet search in order to give more details than what the system know.
I took the example of a child, when he's born he need to learn everything, he heard a lot and then propose some answers. His mom/dad tell him if the answer are suitable or not.
In order to do that I wanted to stock a lot of chat conversations in an hadoop system and parse all of those conversation in order to determine which are the most frequent answer given. With that I want to construct a neuronal database who contains conversations types with the determined answers.
So my question is can I find somewhere legally on the internet one or more chat/conversation database in any format? (file, database, csv, ...)
The most data I have the best my chance are to be able to determine correctly the answers ;)
Thanks for the help and cheers,
Frédéric
PS: English is not my mother tongue
There is a collection of conversational datasets. Most of them are collected from publicly available sources. For you the most interesting ones could be the Santa Barbara corpus (although it's a transcript of speech conversations) or the movie dialog dataset.
Here is a fairly comprehensive collection of human-human and human-machine text dialogue datasets, as well as audio dialogue datasets.
https://breakend.github.io/DialogDatasets/
Credits goes to "Default picture"'s answer from above for the extensive library of Human-Human, Human-Machine sonversation resources at https://breakend.github.io/DialogDatasets/ including the Let’s Go dialogs from provided by Research Center at CMU https://github.com/DialRC/LetsGoDataset those resources are also used to train conversational agents at https://any.company/
The best way to have a Chat dataset is to generate on your own. You know what exactly you want. But IRC has some chat datasets that one of them has been used in this research.
I'm having some doubts about which system should I use for a new software.
No code has been written yet, I'm just breaking apart all the needs and only then start coding.
This will be implemented in a computer company that provides services for other companies, onsite and remotely.
These are my variables:
Number of technicians
Location of customer
Type of problem
Services already scheduled for the technician
Expertise of the technician about the situation
Customer priority
Maybe some are missing, but these are the most important ones.
This job is being done manually, and has humans, we fail to see the best route to be taken sometimes.
Let's say that a customer calls with a printer problem.
First, check which tech knows about printers.
Then, is the tech available? far from the customer? can it be done remotely (software issues)?
Can it be done by another tech who is closer from the customer location?
Does this customer have more priority than the other where the same tech should be going?
Is the technician schedule full? If yes, pass to another printer/hardware tech.
I know my english is not perfect (not my natural language), but I'll try to provide more details or correct the text as needed.
So, my question is this, what kind of approach would you take? Genetic algorithm seems nice for this kind of job, and I also have some experience with GAF and WatchMaker (Java GA Framework). However, when reading the text above, an expert system seems also appropriate.
Have someone done something like this?!I had search for this kind of software and couldn't find anything alike.
Would another approach be better than the two asked?!
Also, I'm building up a table with all the techs capabilities and expertise, with simple rules like, 1 to 5 about each expertise. This is also a decision factor.
Thanks.
Why not do both? Use an expert system (a rule engine) to define your constraints and use a metaheuristic (such as Local Search or Genetic Algorithms) to solve it. The planning engine OptaPlanner (java, open source) does exactly that (by using the rule engine Drools). The architecture look likes this:
Here's a video demonstrating the constraint flexibility on the vehicle routing problem (VRP). Your problem seems to be an advanced variant on VRP (which is a variant on TSP).
Maybe you can start off with TSP,
here http://en.m.wikipedia.org/wiki/Travelling_salesman_problem
I guess it only deals with the distance.
Is there any algorithm with which I can automatically create a playlist of songs that well with each other -- similarly to services like iTunes Genius -- that a single developer can actually implement? It should either a) not require any sort of remote database of listening habits etc. or b) require such a database, but work with one that is freely available.
i did this, and i used the last.fm database as described by tomasz. i didn't use "related artist" directly, but instead constructed my own relationship graph by comparing tags associated with different artists (this is not the approach suggested by lcfseth btw - i have quite a large range of music and i wanted to explore "natural" connections that might not be common partners in "normal" playlists; also i wasn't sure how uniform the related artists were).
i also used a local database to cache data from last.fm, because calls to the api are rate limited, and i experimented with using other parts of the api to improve / normalize the information i was reading from mp3 tags.
generating a useful graph of related artists was actually quite hard; largely because some nodes in the graph naturally tend to be more important than others. if you don't "even out" the graph then your playlist will keep returning to the "important" artists.
the final result did work well, in that the selection of music had a good balance between "central theme" and variation. but the implementation is not at all polished, the calculation of the graph can take a long time (many hours), the program takes up a fair amount of memory when running, and it still seems to play elvis costello a little more than expected ;o)
if you are interested, the code is at http://code.google.com/p/uykfe/
the best part of all, from my point of view as a user, is that it can update logitech media server (squeezeserver) playlists in "realtime", adding a new track whenever the list is empty. that works really well in continuing from whatever music you select "by hand". it can also generate one-off playlists, of course, and, finally, by tweaking parameters you can get a kind of "random walk" through your music collection - it will play related tunes but slowly drift from one style to another (in fact, this is really the "default" mode - to get it to stay on a single theme i needed extra logic that biased it towards whatever music it had played earlier).
ps also, the dump of the final graph to gephi was really cool - i had it printed out and it's now pinned to the wall...
pps i also experimented with the musicbrainz database, which in theory sounds like a fantastic resource. but in practice it is over-complex and poorly documented.
I don't know iTunes Genius, but I think last.fm database and API might be useful for you. Every time you see any track it shows you a list of similar tracks, based on other users preferencs. The same information can be obtained using track.getSimilar API method.
The idea behind most of these databases, is to see what other users listens to after they listen to a given song. The accuracy of these statistics depends on the number of users therefor it is probably hard to use this locally. The algorithm itself is not that hard to implement.
The alternative would be to sort song based on genre, singer... which are informations that are usually embedded in the songs but not always. Winamp have this feature, but it won't work for old songs, unless you manually set the informations or use an On-line song database.
I'd like to write my own music streaming web application for my personal use but I'm racking my brain on how to manage it. Existing music and their location's rarely change but are still capable of (fixing filename, ID3 tags, /The Chemical Brothers instead of /Chemical Brothers). How would the community manage all of these files? I can gather a lot of information through just an ID3 reader and my file system but it would also be nice to keep track of how often played and such. Would using iTunes's .xml file be a good choice? Just keeping my music current in iTunes and basing my web applications data off of it? I was thinking of keeping track of all my music by md5'ing the file and using that as the unique identifier but if I change the ID3 tags will that change the md5 value?
I suppose my real question is, how can you keep track of large amounts of music? Keep the meta info in a database? Just how I would connect the file and db entry is my real question or just use a read when need filesystem setup.
I missed part 2 of your question (the md5 thing). I don't think an MD5/SHA/... solution will work well because they don't allow you to find doubles in your collection (like popular tracks that appear on many different samplers). And especially with big collections, that's something you will want to do someday.
There's a technique called acoustic fingerprinting that shows a lot of promise, have a look here for a quick intro. Even if there are minor differences in recording levels (like those popular "normalized" tracks), the acoustic fingerprint should remain the same - I say should, because none of the techniques I tested is really 100% errorfree. Another advantage of these acoustic fingerprints is that they can help you with tagging: a service like FreeDB will only work on complete CD's, acoustic fingerprints can identify single tracks.
For inspiration, and maybe even for a complete solution, check out ampache. I don't know what you call large, but ampache (a php application backed by a mysql db) easily handles music collections of tens of thousands of tracks.
Reecently I discovered SubSonic, and the web site says "Manage 100,000+ files in your music collection without hazzle" bt I haven't been able to test it yet. It's written in Java and the source looks pretty neat at first sight, so maybe there's inspiration to get there too.
Screen scraping seems like a useful tool - you can go onto someone else's site and steal their data - how wonderful!
But I'm having a hard time with how useful this could be.
Most application data is pretty specific to that application even on the web. For example, let's say I scrape all of the questions and answers off of StackOverflow or all of the results off of Google (assuming this were possible) - I'm left with data that is not very useful unless I either have a competing question and answer site (in which case the stolen data will be immediately obvious) or a competing search engine (in which case, unless I have an algorithm of my own, my data is going to be stale pretty quickly).
So my question is, under what circumstances could the data from one app be useful to some external app? I'm looking for a practical example to illustrate the point.
It's useful when a site publicly provides data that is (still) not available as an XML service. I had a client who used scraping to pull flight tracking data into one of his company's intranet applications.
The technique is also used for research. I had a client who wanted to compare the contents of several online dictionaries by part of speech, and all of these sites had to be scraped.
It is not a technique for "stealing" data. All ordinary usage restrictions apply. Many sites implement CAPTCHA mechanisms to prevent scraping, and it is inappropriate to work around these.
A good example is StackOverflow - no need to scrape data as they've released it under a CC license. Already the community is crunching statistics and creating interesting graphs.
There's a whole bunch of popular mashup examples on ProgrammableWeb. You can even meet up with fellow mashupers (O_o) at events like BarCamps and Hack Days (take a sleeping bag). Have a look at the wealth of information available from Yahoo APIs (particularly Pipes) and see what developers are doing with it.
Don't steal and republish, build something even better with the data - new ways of understanding, searching or exploring it. Always cite your data sources and thank those who helped you. Use it to learn a new language or understand data or help promote the semantic web. Remember it's for fun not profit!
Hope that helps :)
If the site has data that would benefit from being accessible through an API (and it would be free and legal to do so), but they just haven't implemented one yet, screen scraping is a way of essentially creating that functionality for yourself.
Practical example -- screen scraping would allow you to create some sort of mashup that combines information from the entire SO family of sites, since there's currently no API.
Well, to collect data from a mainframe. That's one reason why some people use screen scraping. Mainframes are still in use in the financial world and often it's running software that has been written in the previous century. The people who wrote it might already be retired and since this software is very critical for these organizations, they really hate it when some new code needs to be added. So, screenscraping offers an easy interface to communicate with the mainframe to collect information from the mainframe and then send it onwards to any process that needs this information.
Rewrite the mainframe application, you say? Well, software on mainframes can be very old. I've seen software on mainframes that was over 30 years old, written in COBOL. Often, those applications work just fine and companies don't want to risk rewriting parts because it might break some code that had been working for over 30 years! Don't fix things if they're not broken, please. Of course, additional code could be written but it takes a long time for mainframe code to be used in a production environment. And experienced mainframe developers are hard to find.
I myself had to use screen scraping too in a software project. This was a scheduling application which had to capture the output to the console of every child process it started. It's the simplest form of screen scraping, actually, and many people don't even realize that if you redirect the output of one application to the input of another, that it's still a kind of screen scraping. :)
Basically, screen scraping allows you to connect one (web) application with another one. It's often a quick solution, used when other solutions would cost too much time. Everyone hates it, but the amount of time it saves still makes it very efficient.
Let's say you wanted to get scores from a popular sports site that did not offer the information available with an XML feed or API.
For one project we found a (cheap) commercial vendor that offered translation services for a specific file format. The vendor didn't offer an API (it was, after all, a cheap vendor) and instead had a web form to upload and download from.
With hundreds of files a day the only way to do this was to use WWW::Mechanize in Perl, screen scrape the way through the login and upload boxes, submit the file, and save the returned file. It's ugly and definitely fragile (if the vendor changes the site in the least it could break the app) but it works. It's been working now for over a year.
One example from my experience.
I needed a list of major cities throughout the world with their latitude and longitude for an iPhone app I was building. The app would use that data along with the geolocation feature on the iPhone to show which major city each user of the app was closest to (so as not to show exact location), and plot them on a 3D globe of the earth.
I couldn't find an appropriate list in XML/Excel/CSV type format anywhere easily, but I did find this wikipedia page with (roughly) the info I needed. So I wrote up a quick script to scrape that page and load the data into a database.
Any time you need a computer to read the data on a website. Screen scraping is useful in exactly the same instances that any website API is useful. Some websites, however, don't have the resources to create an API themselves; screen scraping is the developer's way around that.
For instance, in the earlier days of Stack Overflow, someone built a tool to track changes to your reputation over time, before Stack Overflow itself provided that feature. The only way to do that, since Stack Overflow has no API, was to screen scrape.
The obvious case is when a webservice doesn't offer reverse search. You can implement that reverse search over the same data set, but it requires scraping the entire dataset.
This may be fair use if the reverse search also requires significant pre-processing, e.g. because you need to support partial matching. The data source may not have the technical skills or computing resources to provide the reverse search option.
I use screen scraping daily, I run some eCommerce sites and have screen-scraping scripts running daily to gather product lists automatically from my suppliers wholesale sites. This allows me to have upto date information on all the products available to me from several suppliers and allows me to flag non-economical margins due to price changes.