Chat/conversation database - database

For a personal interest I try to define a simulated AI who is based on information that he learned and internet search in order to give more details than what the system know.
I took the example of a child, when he's born he need to learn everything, he heard a lot and then propose some answers. His mom/dad tell him if the answer are suitable or not.
In order to do that I wanted to stock a lot of chat conversations in an hadoop system and parse all of those conversation in order to determine which are the most frequent answer given. With that I want to construct a neuronal database who contains conversations types with the determined answers.
So my question is can I find somewhere legally on the internet one or more chat/conversation database in any format? (file, database, csv, ...)
The most data I have the best my chance are to be able to determine correctly the answers ;)
Thanks for the help and cheers,
Frédéric
PS: English is not my mother tongue

There is a collection of conversational datasets. Most of them are collected from publicly available sources. For you the most interesting ones could be the Santa Barbara corpus (although it's a transcript of speech conversations) or the movie dialog dataset.

Here is a fairly comprehensive collection of human-human and human-machine text dialogue datasets, as well as audio dialogue datasets.
https://breakend.github.io/DialogDatasets/

Credits goes to "Default picture"'s answer from above for the extensive library of Human-Human, Human-Machine sonversation resources at https://breakend.github.io/DialogDatasets/ including the Let’s Go dialogs from provided by Research Center at CMU https://github.com/DialRC/LetsGoDataset those resources are also used to train conversational agents at https://any.company/

The best way to have a Chat dataset is to generate on your own. You know what exactly you want. But IRC has some chat datasets that one of them has been used in this research.

Related

OpenAI GPT API pre-tokenizing?

I am trying to make a "personal assistant" chatbot (using GPT AI API) that can answer questions about myself when others ask it things. In order to do so, I have to give it a lot of information about myself, which I am currently doing in the prompt.
Example:
This means that every time someone asks a question, the prompt includes all of the information about me, which means that it gets tokenized every single time a question is asked. Is there a way to "pre-tokenize" the information about myself or store it in some other way? I ask this because the information about myself is what is costing me the most, as it sucks up a lot of tokens.
Thanks
I have tried looking online to no avail.
There are a couple of ways around this.
You could use the Playground to summarise your informaiton, but this might lose some of the meaning. This is the simplest approach.
Probably the most effective approach would be to fine-tune your own model that already has the data. This is more time consuming.
https://platform.openai.com/docs/guides/fine-tuning
This is from the OpenAI website:
Fine-tuned models.
Create your own custom models by fine-tuning our base models with your training data. Once you fine-tune a model, you’ll be billed only for the tokens you use in requests to that model.

Organisation of database for stackoverflow type website

I want to make a website on the lines of stackoverflow. I need some suggestions on how do I organise my databases....like what fields should there be for each user, how to organise the questions asked and the replies to them,the votes they get....if anyone can help me with the required database tables and respective fields in them. Thanks in advance :)
See for yourself:
https://data.stackexchange.com/stackoverflow/query/new
Or download and experiment with some of the SO clones:
http://www.osqa.net/
List of clones:
https://meta.stackexchange.com/questions/2267/stack-overflow-clones
im not an arrogant a-hole or anything, but that all depends on what you want to optimize for effeciency. You want us to tell you what a DBA would to do make a website like this.
The only knowledge i can part with my years of experience designing websites and as a software engineer.
Determine every variable and subasset your system will need. WHat will you require of the users, and need to maintain. AFterwards, get our some postit notes, and go to a whiteboard.
When you figure out your table schema, post it back here so we can help you optimize. IMO, you arent asking a question as asking for something for free.
It is a lengthy process. When YOU figure out how you want the tables, we can adjust all your varchars, and bits, and clobs to be the best.
Until then, I wish you luck.
Sidenote: I have had to do this from scratch with a team of 5 to figure out and design a project. Just sit down 1 day and determine every thing you want to accomplish and then as your sticky notes can be moved around, Do that to figure out what you think would go well together.

Access to the Music Genome Database

I'm interesting in finding songs based on attributes (minor key tonality, etc). These are things listed in the details of why Pandora picks songs, but using Pandora, I have to give it songs/artists.
Is there any way to get the Music Genome database (or something similar) so I can search for songs based on attributes (that someone else has already cataloged)
You can use Gracenote's Global Media Database and search with Track-level attributes.
"Gracenote's Media Technology Lab scientists and engineers take things further by utilizing technologies like Machine-Listening and Digital Signal Processing to create deep and detailed track level descriptors such as Mood and Tempo."
I don't think there is any way to access this proprietary data, something I asked them about long ago. It seems to me they want to protect this unique part of their system; after all, they've paid for the man hours to label each song. Even if Pandora releases a developer API, which they've hinted at, I doubt it will provide access to the Music Genome information.
Give Echo Nest a shot!
To add to above answers, Pandora's statement (as viewed using the above link in combination with the Internet Archive) was:
"A number of folks also asked about the prospect for an open API, to allow individual developers to start building on the platform. We're not there yet, but it's certainly food for thought."
Given that this was seven years ago, I think their decision is pretty clear.

Silverlight demo for bioinformatics

I am a beginner Silverlight programmer preparing for the interview in medical research company. Job sounds damn interesting and I would like to get there.
To show my skills and interest, I want to write a program related to the topic.
What would you suggest?
First ideas: simple statistical analysis of input data, image collections (for example, find HD DNA image and put it in Silverlight Deep Zoom), lab inventory program..
Have a look at http://research.microsoft.com/en-us/projects/bio/default.aspx the Microsoft Biology Foundation, part of Microsoft Research. Its code is OpenSource (sic) and you will find many applications there. The apps cover most of the basics, sequences, etc. and have some nice display tools.
Collection/Maintenance/Retrieval of data is very important for any organization. Try this tutorial:
Silverlight tutorial: Creating a data centric web app with DataGrid, LINQ, and WCF Web Service
You placed a "bioinformatics" tag on your question. Many bioinformatics companies consider Perl programming to be quite important.
I suggest that you perform a search on "bioinformatics Perl" and take a look at books and sites that are retrieved. Perhaps you could park yourself at a local bookstore and peruse some of those titles. Free Perl interpreters are available.
You do have a basic understanding of genetics, yes? Be very familiar with some of the terminology, so you won't have to stare off into space while you pick from genotype/phenotype or mRNA/RNA/DNA or recall what a codon is.
It wouldn't hurt to nose around PubMed and get a basic understanding of what genomes are out there and what statistical tests can be performed on them.
I like your statistics idea. Perhaps you could write a program that tells you whether to accept or reject a null hypothesis based on numbers you read in from a file. Or perhaps you could figure out how to use the statistics portion of Entrez Gene in PubMed.
Best wishes for the interview.

How to get book metadata?

My application needs to retrieve information about any published book based on a provided ISBN, title, or author. This is hardly a unique requirement---sites like Amazon.com, Chegg.com, and even software like Book Collector seem to be able to do this easily. But I have not been able to replicate it.
To clarify, I do not need to search the entire database of books---only a limited subset which have been inputted, as in a book collection. The database would simply allow me to tag the inputted books with the necessary metadata to enable search on that subset of books. So scale is not the issue here---getting the metadata is.
The options I have tried are:
Scrape Amazon. Scraping the regular Amazon pages was not very robust to things like missing authors, and while scraping the smaller mobile pages was faster, they shared the same issues with robustness of extraction. Plus, building this into an application is a clear violation of Amazon's Terms of Service.
Scrape the Library of Congress. While this seems to have fewer legal ramifications, ease and robustness were again issues.
ISBNdb.com API. While the service is free up to a point, and does a good job of returning the necessary metadata, I need to do this for over 500 books on a daily basis, at which point this service costs money proportional to use. I'd prefer a free or one-time payment solution that allows me to do the same.
Google Book Data API. While this seems to provide the information I need, I cannot display the book preview as their terms of service requires.
Buy a license to a database of books. For example, companies like Ingram or Baker & Taylor provide these catalogs to retailers and libraries. This solution is obviously expensive, so I'm hoping that there's a more elegant solution I've missed. But if not, and someone on SO has had a good experience with a particular database, I'm willing to go with that.
I've tried to describe my approach in detail so others with fewer books can take advantage of the above solutions. But given my requirements, I'm at my wits' end for retrieving book metadata.
Since it is unlikely that you have to retrieve the same 500 books every day: store the data retrieved from isbndb.com in a database and fill it up book by book.
Instead of scraping Amazon, you can use the API they expose for their affiliate program: https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html
It allows about 3k requests per hour and returns well-formed XML. It requires you to set a link to the book that you show the information about, and you must state that you are an affiliate partner.
This might be what you're looking for. They even offer a complete download!
https://openlibrary.org/data
As it seems, a lot of libraries and other organisations make information such as "ISBN" available through MAchine-Readable Cataloging aka MARC, you can find more information about it here as well.
Now knowing the "right" term to search for I discovered WorldCat.org.
Maybe this whole MARC thing gives you a new kind of an idea :)

Resources