Save clusters in a text file in rapidminer 7

Save clusters in a text file in rapidminer 7 - artificial-intelligence

I am using RapidMiner 7 to run unsupervised learning algorithms for my research. My requirement demands me to save the clusters (results) in a text file. I found some methods used in RapidMiner 5. But they are no longer available in current version. Is there anyway I can do it in RapidMiner 7.

You can save the cluster models using Write Clustering or Write Model. These create XML output. You can also use Write as Text to write a simplified summary of the cluster model.
If you want to save the example set containing the allocation of examples to clusters, you can use Write CSV.

Related

Which COBOL database do I have?

A simple but, heh, still weird question. Hope in good section, couldn't find decent answer in whole internet.
First of all, it looks strongly like COBOL (ACUCOBOL?), but I am not sure.
I have binary files with extensions: .AC, .vix, .SC; several MBytes each. Most of files are in pairs eg. ADDRESSES.AC + ADDRESSES.vix or COMPANIES.SC + COMPANIES.vix.
In the middle of these files I can see parts of records, however it seems to be a set of binary files.
No human readable indexes, maps, dialects, configuration files, headers that I know exists in Cobol databases - nothing to be parsed using some normal text tools. No CPY, RDD, XFD files as well. Just files with a lot of binary data and parts of records/ids (?) from time to time. So I can determine e.g., that one file contains set of addresses, next apparently sales, next client data etc.
Questions are:
How to determine which version of COBOL database am I using? (Mostly to obtain a proper tool to extract the data.)
How to convert this database to something that can be parsed and moved to whatever else - even Excel?
I have no access to computer that was working with this database as it is deep in the litter bin from many years, nothing else remained, just one folder with database files.
Had anybody the same problem?
Here is sample:

How to determine which version of COBOL database am I using?
You aren't using a database but ISAM files, very likely ACUCOBOL GT file format 5. For details about the format see official documentation.
Mostly to obtain a proper tool to extract the data.
The proper tool would be vutil and the command vutil -u -t ADDRESSES.AC ADDRESSES.TXT which will present you with a text file that is very likely in fixed-length form (variable form is relative uncommon) -> step 1.
As the data likely contains binary fields you have to investigate the data to check the actual format/record layout --> step2, and calculate the decimal values from the binary fields --> step 3.
But there are tools out there that help you with step 2 and 3, I can recommend RecordEditor where you'll see the data, can set field widths/types (defining a record layout, similar to Excel Import, but also allows you to use binary COBOL types) and convert the resulting file to CSV.
If you don't have access to vutil (or vutil32.exe on Windows) you may find someone that has access to this tool and convert the data for you; or get an evaluation version (would be an old download, the new product owner of ACUCOBOL-GT is MicroFocus and only provides evaluation versions of their not-compatible "Visual COBOL" product).
Alternatively you can reverse-engineer the format (the record layout is in the vix-file, open it with an hex-editor and dive in), but this likely is a bigger task...
Summary:
decide how to do step 1, vutil/vutil32.exe is the easiest way
1: convert the data to text format
2: investigate the files and inspect for the record layout (field width, type)
3: load the file, convert binary fields, export as csv

You definitely have the vision indexed data files as you will see the .vix files which match, if you do not have a .vix file then it is a relative file with a set no of records.
If you have Acubench under the Tools Menu there is an option for Vision File Utility, from there you can Unload your Vision Data to a text file which is tab delimited.
From there you can import to Excel as a tab delimited file and then re-save as a csv file.

So after all I suppose this was ISAM version.
To untangle this the following tools were needed:
First of all some migration tool. In my case it was ISMIGRATE GUI WIzard:
This package comes from isCOBOL 2017 R1, you can find some free demos to download. Note, that you don't need install all package, just this migration tool.
Then you can use ctree2 -> jisam conversion or just try all available options (not every one is available cause of missing libraries that are paid)
After conversion you'll end with something like this:
In worse cases there will be some ASCII special chars, but you can get rid of them using some tools like Notepad++, or even Excel. I mean to search for them by HEX code and replace by space (note, that space will replace one missing character to preserve column ordering)
Note, that you can as well use special function of importing ASCII text files from MS Access/MS Excel. It is really helpful.
to position everything correctly, cut this file and do all adjustements (and export to e.g. csv) you can use http://record-editor.sourceforge.net
that is free. Note, that after several trials I've noticed, that other even paid tools rather won't help you. The problem is in 1st point: conversion.
To be sure that everything works fine you can run even MS Access or similar to see how to create foreign keys and reverse-engineer all database. Having working preview it will be easy to do that on larger scale e.g. in PostgreSQL/Oracle.
That's it. I hope it'll be useful for somebody.
What was UNSUCCESSFUL:
Estabilishing Actian Vector server; it is really great and free tool, but it won't help you significantly
Trying some online tools (despite of who knows where data will be sent)
Any other ASCII editors, cause in my case many of them crashed, i suppose because of size of files and because of some control chars (?)

Is it possible to write/read metadata for a text file using Labview?

The situation
I use Labview 2012 on Windows 7
my test result data is written in text files. First, information about the test is written in the file (product type, test type, test conditions etc) and after that the logged data is written each second.
All data files are stored in folders, sorted to date and the names of the files contain some info about the test
I have years worth of data files and my search function now only works on the file names (opening each file to look for search terms costs too much time)
The goal
To write metadata (additional properties like Word files can have) with the text files so that I can implement a search function to quickly find the file that I need
I found here the way to write/read metadata for images, but I need it for text files or something similar.

You would need to be writing to data files that supports meta data to begin with (such as LabVIEW TDMS or datalog file formats). In a similar situation, I would simply use a separate file with the same name, but a different extension for example. Then you can index those file names, and if you want the data you just swap the meta data filename extension and you are good to go.

I would not bother with files and use database for results logging. It may be not what you wiling to do, but this is the ultimate solution for the search problem and it open a lot of data analytics possibilities.

The metadata in Word files is from a feature called "Alternative Data Streams" which is actually a function of NTFS. You can learn more about it here.
I can't say I've ever used this feature. I don't think there is a nice API for LabVIEW, but one could certainly be made. With some research you should be able to play around with this feature and see if it really makes finding files any easier. My understanding is that the data can be lost if transferred over the network or onto a non-NTFS thumbdrive.

Tool for manipulating results of large set of computational experiments

I am a researcher and my primary interest is improving sparse kernels for high performance computing. I investigate large number of parameters on many sparse matrices. I wonder whether there is a tool to manage these results. The problems that I encounter are:
Combine results of several experiments for each matrix
Version the results
Taking average, finding minimum/maximum/standard deviation of results
There are hundreds of metrics that describe the performance improvement. I want to select a couple of the metrics easily and try to find which metric correlates with the performance improvement.
Here I gave a sample small instance of my huge problem. There are three types of parameters and two values for each parameter: Row/Column, Cyclic/Block, HeuristicA/HeuristicB. So there must 8 files for the combination of these parameters. Contents of two of them:
Contents of the file RowCyclicHeuristicA.txt
a.mtx#3#5.1#10#2%#row#cyclic#heuristicA#1
a.mtx#7#4.1#10#4%#row#cyclic#heuristicA#2
b.mtx#4#6.1#10#3%#row#cyclic#heuristicA#1
b.mtx#12#5.7#10#7%#row#cyclic#heuristicA#2
b.mtx#9#3.1#10#10%#row#cyclic#heuristicA#3
Contents of the file ColumnCyclicHeuristicA.txt
a.mtx#3#5.1#10#5%#column#cyclic#heuristicA#1
a.mtx#1#5.3#10#6%#column#cyclic#heuristicA#2
b.mtx#4#7.1#10#5%#column#cyclic#heuristicA#1
b.mtx#3#5.7#10#9%#column#cyclic#heuristicA#2
b.mtx#5#4.1#10#3%#column#cyclic#heuristicA#3
I have a scheme file to describe the contents of these files. This file has a line describing type and meaning of each column in the result files:
str MatrixName
int Speedup
double Time
int RepetationCount
double Imbalance
str Parameter1
str Parameter2
str Parameter3
int ExperimentId
I need to display average Time and two types of parameters as follows: (numbers in the following table are random)
Parameter1 Parameter2
Matrix row col cyclic block
a.mtx 4.3 5.2 4.2 5.4
b.mtx 2.1 6.3 8.4 3.3
Is there an advanced and sophisticated tool that gets the scheme of the table above and generates this table automatically? Currently I have a tool written in Java to process raw files and Latex code to manipulate and display the table using pgfplotstable. However, I need one tool that is more professional. I do not want pivot tables of MS Excel.
A similar question is here.

Manipulating large amounts of data in an unknown format is...challenging for a generic program. Your best bet is probably similar to what you're doing already. Use a custom program to reformat your results into something easier to handle (backend), and a visualisation program of your choice to let you view and play around with the data (frontend).
Backend
For your problem I'd suggest a relational database (e.g.Mysql). Has a longer setup time than other options, but if this is an ongoing problem it should be worthwhile, as it allows you to easily pull fields of interest.
SELECT AVG(Speedup) FROM results WHERE Parameter1="column" AND Parameter2="cyclic" for example. You'll then still need a simple script to insert your data in the first place, and then to pull the results of interest in a useful format you can stick into your viewer. Or if you so desire you can just run queries directly against the db.
Alternatively, what I usually use is just Python or Perl. Read in your data files, strip the data you don't want, rearrange into the desire structure, and write out to some standard format your frontend and use. Replace Python/Perl with the language of your choice.
Frontend
Personally, I almost always use Excel. The backend does most of the heavy lifting, so I get a csv file with the results I care about all nicely ordered already. Excel then lets me play around with the data, doing stuff like taking averages, plotting, reordering, etc fairly simply.
Other tools I use to display stuff which are probably not useful for you, but included for completeness include:
Weka - Mostly machine learning targetted, but provides tools for searching for trends or correlations. Useful to play around with data looking for things of interest.
Python/IDL/etc - For when I need data that can't be represented by a spreadsheet. These programs can, in addition to doing the backend's job of extracting and bulk manipulations, generate difference images, complicated graphs, or whatever else I need.

How to version shapefiles

The program I work on has several shapefiles, with quite a few attributes. At the moment they are stored in our version control (Subversion) as compressed blobs (dbf.gz, shp.gz and shx.gz). This is how they are used by the program, but it's extremely inconvenient for versioning purposes. We get no information about changes to entries, or attributes - just that something, somewhere in the file has changed. No useful diff.
The DBF is the one that has the attributes. I was thinking maybe we could store it as CSV and then as part of the build process, convert it to DBF and do ??? (to be determined) to make it a valid shapefile, then make the zipped version as it currently uses.
Another approach might be to remove nearly all the attributes from the shapefile, store those in CSV/YAML/whatever (which can be versioned nicely), and either look them up by the shape IDs or try to attach them to our objects after they have been instantiated from shapefiles, something like that.
But maybe folks with more experience with shapefiles have better ideas?

The DBF you are referring to starting your second paragraph has the attributes. Why not dump out the table on a "per Shape" basis to an XML style file and use THAT for the subversion. If you are actually working within Visual Foxpro (which also uses DBF style files too), you could use the function CursorToXML() and just run that through a loop of distinct shapes and dump out to each respective XML file. Then, when reading it back in.... XMLToCursor() of the per file shape.

Silverlight Isolated Storage and loading big files

In a Windows Phone 7 application, I would like to query a big XML file (list of cities) stored using Isolated Storage. If I do that this way, will the file be loaded to memory (> 5 mo) ? If so, what other solution do I have?
Edit:
More details. I want to use AutoCompleteBox (http://www.jeff.wilcox.name/2008/10/introducing-autocompletebox/), but instead of using a web service (this is fixed data, no need to be online), I want to query a file/database/isolated storage... I have a fixed list of cities. I said in the comments it's 40k, but it finally seems closer to 1k rows.

instead of using isolatedstorage for this, would it be an option for you to use a webservice instead... or do you design your app for an offline approach?
querying a webservice, wcf or json enabled webservice is really simple, and will be easier for you to maintain :)

Rather than have a big file containing all the data can you not break it down into lots of smaller files. (One for each city?)
You could have a separate file to keep an index of them all if need be. Alternatively, depending on the naming of the files, you may be able to use IsolatedStorageFile.GetFileNames to get a list of all files.

I would create my own file format, using, for example, a separator between fields, with one row for each record.
That way you can read your file line-by-line to fill your data structure with these advantages:
no need to pull the whole file into memory
no XML overhead (in a desktop application it may not be a problem, but in the phone context a 5 MB text file may become quite a bit smaller)
Dumb example:
New York City; 12345
Berlin; 25635
...
EDIT: given that the volume is not that large you don't need any form of indexing or loading on-demand. I would store the cities as stated above -one record per line-, load them in a list and use LINQ to select the items you need. This will probably be fast and keep your application very responsive.
In this case, in my opinion, XML is not the best tool for the job. Your structure is very simple and storing in XML would probably double the file size, which is a concern for a mobile device, and would also slow the parsing, also a concern in this case.