How to use Weka for predicting results - dataset

I'm new to Weka and I'm confused with the tool. I have a data set about fruit prices and related attributes. I'm trying to predict the specific fruit price using the data set. Since I'm new to Weka, I couldn't figure out how to do this task. Please help me or guide me to a tutorial about how to do predictions, and what is the best method or algorithm for this task.

If you want to know more about saving a trained classifier and loading it later to predict, please refer to the following.
With the assumption that you want to use the Weka GUI, you have to go through these two steps:
First, use some pre-labelled data to train a classifier (use your fruit prices data). Make sure the data is in ARFF format. After training, save the model to your disk.
More on this can be found here: https://waikato.github.io/weka-wiki/saving_and_loading_models/
In the second step, you use the already trained model (done in step 1). Specifically, you have to load the model file (saved in step 1) and then use the 'supplied test set" option on the "Classifiers" tab. In the "supplied test set" option, select the un-labelled data.
More on this can be found here: https://waikato.github.io/weka-wiki/making_predictions/
I would suggest first playing around with the ARFF data files that come with your Weka install (these ARFF files are basically sitting under your Weka install directory. In my case it is under: C:\Program Files\Weka-3-7\data).
Some more useful URLs:
https://developer.ibm.com/technologies/analytics/articles/os-weka1/
http://ortho.clmed.ncku.edu.tw/~emba/2006EMBA_MIS/3_16_2006/WekaIntro.pdf
Hope that helps.

I think the best step by step method for getting predictions from an existing trained model can be found here - https://machinelearningmastery.com/save-machine-learning-model-make-predictions-weka/

Related

Which COBOL database do I have?

A simple but, heh, still weird question. Hope in good section, couldn't find decent answer in whole internet.
First of all, it looks strongly like COBOL (ACUCOBOL?), but I am not sure.
I have binary files with extensions: .AC, .vix, .SC; several MBytes each. Most of files are in pairs eg. ADDRESSES.AC + ADDRESSES.vix or COMPANIES.SC + COMPANIES.vix.
In the middle of these files I can see parts of records, however it seems to be a set of binary files.
No human readable indexes, maps, dialects, configuration files, headers that I know exists in Cobol databases - nothing to be parsed using some normal text tools. No CPY, RDD, XFD files as well. Just files with a lot of binary data and parts of records/ids (?) from time to time. So I can determine e.g., that one file contains set of addresses, next apparently sales, next client data etc.
Questions are:
How to determine which version of COBOL database am I using? (Mostly to obtain a proper tool to extract the data.)
How to convert this database to something that can be parsed and moved to whatever else - even Excel?
I have no access to computer that was working with this database as it is deep in the litter bin from many years, nothing else remained, just one folder with database files.
Had anybody the same problem?
Here is sample:
How to determine which version of COBOL database am I using?
You aren't using a database but ISAM files, very likely ACUCOBOL GT file format 5. For details about the format see official documentation.
Mostly to obtain a proper tool to extract the data.
The proper tool would be vutil and the command vutil -u -t ADDRESSES.AC ADDRESSES.TXT which will present you with a text file that is very likely in fixed-length form (variable form is relative uncommon) -> step 1.
As the data likely contains binary fields you have to investigate the data to check the actual format/record layout --> step2, and calculate the decimal values from the binary fields --> step 3.
But there are tools out there that help you with step 2 and 3, I can recommend RecordEditor where you'll see the data, can set field widths/types (defining a record layout, similar to Excel Import, but also allows you to use binary COBOL types) and convert the resulting file to CSV.
If you don't have access to vutil (or vutil32.exe on Windows) you may find someone that has access to this tool and convert the data for you; or get an evaluation version (would be an old download, the new product owner of ACUCOBOL-GT is MicroFocus and only provides evaluation versions of their not-compatible "Visual COBOL" product).
Alternatively you can reverse-engineer the format (the record layout is in the vix-file, open it with an hex-editor and dive in), but this likely is a bigger task...
Summary:
decide how to do step 1, vutil/vutil32.exe is the easiest way
1: convert the data to text format
2: investigate the files and inspect for the record layout (field width, type)
3: load the file, convert binary fields, export as csv
You definitely have the vision indexed data files as you will see the .vix files which match, if you do not have a .vix file then it is a relative file with a set no of records.
If you have Acubench under the Tools Menu there is an option for Vision File Utility, from there you can Unload your Vision Data to a text file which is tab delimited.
From there you can import to Excel as a tab delimited file and then re-save as a csv file.
So after all I suppose this was ISAM version.
To untangle this the following tools were needed:
First of all some migration tool. In my case it was ISMIGRATE GUI WIzard:
This package comes from isCOBOL 2017 R1, you can find some free demos to download. Note, that you don't need install all package, just this migration tool.
Then you can use ctree2 -> jisam conversion or just try all available options (not every one is available cause of missing libraries that are paid)
After conversion you'll end with something like this:
In worse cases there will be some ASCII special chars, but you can get rid of them using some tools like Notepad++, or even Excel. I mean to search for them by HEX code and replace by space (note, that space will replace one missing character to preserve column ordering)
Note, that you can as well use special function of importing ASCII text files from MS Access/MS Excel. It is really helpful.
to position everything correctly, cut this file and do all adjustements (and export to e.g. csv) you can use http://record-editor.sourceforge.net
that is free. Note, that after several trials I've noticed, that other even paid tools rather won't help you. The problem is in 1st point: conversion.
To be sure that everything works fine you can run even MS Access or similar to see how to create foreign keys and reverse-engineer all database. Having working preview it will be easy to do that on larger scale e.g. in PostgreSQL/Oracle.
That's it. I hope it'll be useful for somebody.
What was UNSUCCESSFUL:
Estabilishing Actian Vector server; it is really great and free tool, but it won't help you significantly
Trying some online tools (despite of who knows where data will be sent)
Any other ASCII editors, cause in my case many of them crashed, i suppose because of size of files and because of some control chars (?)

Create a relational GWAS/Genomics database

I thought I'd ask before I try to build something from scratch.
Here is the type of problem I need to answer. One of our researchers comes to me and says "How many people in our data have such-and-such SNP genotyped?"
Our genetics data consists of several dozen GWAS files, typically flat delimited. Each GWAS file has between 100,000-1,000,000 SNPs. There is some overlap in the SNPs, but less than I'd originally thought.
Anyway, what I want to do is have some sort of structured database that links our participant IDs to a particular GWAS study, and then link that GWAS study to a list of SNPs, and I can write some kind of query that will pull all IDs that have the data. At no point do I need individual level genotype data, it is way easier to pull the SNP/Samples that I need once I know where they are.
So that is my problem and what I'm looking for. For anyone who works with a lot of GWAS data, I'm sure you're familiar with the problem. Is there anything (free or paid) that is built for this type of problem? Or do you have thoughts on what direction I might want to go if I need to build this myself?
Thanks.

What type of database for storing ML experiments

So I'm thinking to write some small piece of software, which to run/execute ML experiments on a cluster or arbitrary abstracted executor and then save them such that I can view them in real time efficiently. The executor software will have access for writing to the database and will push metrics live. Now, I have not worked too much with databases, thus I'm not sure what is the correct approach for this. Here is a description of what the system should store:
Each experiment will consist of a single piece of code/archive of code such that it can be executed on the remote machine. For now we will assume allow dependencies and etc are installed there. The code will accept command line arguments. The experiment also will consists of a YAML scheme defining the command line arguments. In the code byitself will specify what will be logged in (e.g. I will provide a library in the language for registering channels). Now in terms of logging, you can log numerical values, arrays, text, etc so quite a few types. Each channel will be allowed a single specification (e.g. 2 columns, first int iteration, second float error). The code will also provide special copy of parameters at the end of the experiments.
When one submit an experiments, it will need to provide its unique group name + parameters for execution. This will launch the experiment and log everything.
Implementing this for me is easiest to do with a flat file system. Each project will have a unique name. Each new experiment gets a unique id and folder inside the project. I can store the code there. Each channel gets a file, which for simplicity can be an csv delimeter, with a special schema file describing what type of values are stored there so I can load them there. The final parameters can also be copied in the folder.
However, because of the variety of ways I can do this, and the fact that this might require a separate "table" for each experiment, I have no idea if this is possible in any database systems? Additionally, maybe I'm overseeing something very obvious or maybe not, if you had any experience with this any suggestions/advices are most welcome. The main goal is at the end to be able to serve this to a web interface. Maybe noSQL could accommodate this maybe not (I don't know exactly how those work)?
The data for ML primarily would be unstructured data. That kind of data will not naturally fit into a RDBMS. Essentially a document database like mongodb is far better suited....for such cases.

drink calculator for flash

My charity organisation has asked me to produce a alcohol units calculator just like the one illustrated on the website http://www.drinkaware.co.uk/tips-and-tools/drink-diary/.
I am going to use flash with action script. My question is as I am new to action script
Will i need a database to link the alcohol beverages to in the drop down menus'
If so how is this done or is there a easier way to achieve.
Any tutorial links or advice would be very much aprechiated.
Thanks
Peter
You may not necessarily need a database, as there are a number of ways this could be done. If you're interested in keeping the data outside the flash file, you could use a database.
Connecting Flash to an e database can get pretty complex though, and often there are better options, particularly when the project is a fairly small one like this.
I think XML is a nice option for your data in this case. You could store all the drink data in an XML file and then write AS code to populate the combo boxes with data from the XML.
This will help with creating the drop-down menus (ComboBoxes): http://help.adobe.com/en_US/FlashPlatform/reference/actionscript/3/fl/controls/ComboBox.html
And here's some more info on using XML to store your data:
http://help.adobe.com/en_US/as3/dev/WS5b3ccc516d4fbf351e63e3d118a9b90204-7e6a.html
http://help.adobe.com/en_US/as3/dev/WS5b3ccc516d4fbf351e63e3d118a9b90204-7e71.html
It could all be coded into the Flash piece also, without using a database or XML, but that's generally not considered good practice since it makes updating the data somewhat more difficult.

how to rank gene using information gain?

how gene ranking is done for microarray data using information gain and chi-square statistics ?? Please illustrate with a simple example..
You could use the open source machine learning software Weka. Load your dataset and go to "Select attribute" tab. Use the following attributes evaluators:
ChiSquaredAttributeEval : Evaluates the worth of an attribute by computing the value of the chi-squared statistic with respect to the class.
InfoGainAttributeEval : Evaluates the worth of an attribute by measuring the information gain with respect to the class.
..using Ranker in the "Search Method" . That way the attributes are ranked by their individual evaluations
I don't exactly understand your question, but a very successful package for analyzing microarray data can be found here:
BioConductor
This is a software project that has a variety of different modules for reading data from microarrays and performing statistical analysis. This is very useful, because the file formats for microarray data are constantly changing as the technology develops, and the algorithms for analyzing microarray data have advanced significantly as well.
you can use InfoGainAttributeEval for calculating Information gain
and for more information check this answer

Resources