Is there a way to read a file with inconsistent data using MFL in OSB11g - osb

Recently I got a requirement to read a file and insert those records into a DB. But, when I looked at the file it is not consistent, and the source team is not in a position to alter it in any way. So, is there a way to read it?
Example of a File:
Record1,Record2,Record3,Record4
Record1,Record2,Record3,Record4
Record1,Record2
Record1,Record2,Record3,Record4
Record1,Record2,Record3,Record4
Record1,Record2
Record1,Record2,Record3,Record4
Any inputs will be appreciated.
Regards,
Vishnu.

If I understand correctly, you have a comma-separated list of values and each new line forms a dataset.
You can use MFL Format Builder with delimiter as a comma and generate a standardized XML document out of your data.
This link has a good tutorial to get you started.

Related

CSV file not recognized as csv, reason nominal value not declared in header

I am trying to load a dataset in weka, I have tried many solutions such as arff format, comas etc. but it was all a failure. Could any of you give me a working solution or load this dataset according to the format.
Here is a link to dataset
Instead of using Weka's functionality for reading CSV files, you could use ADAMS (developed at the same university; I'm the lead developer) instead.
Download the adams-ml-app snapshot and then use the Weka Investigator to load/save the file:
Load it as ADAMS Spreadsheets (.csv, .csv.gz)
Save it as Arff data files (.arff, .arff.gz) or Simple ARFF data files (.arff, .arff.gz)
The Reviews column contains an erroneous 3.0M, which prevents it from becoming numeric.
If you want to have an introduction to the Weka Investigator, then take a look at my talk from the Weka User Conference 2021: Taking Weka to the next level with ADAMS .
There are too many issues with lines in this file.
In line 23, I eliminated the odd looking brackets.
I removed all single quotes (')
I eliminated all repeated double quotes ("")
In line 10474 the first two fields (before the number) didn't seem to be separated, so I added a comma.
This allowed the file to go through initial screening, but...
The file contains a lot of odd emojis. I started to eliminate them one by one, but there are clearly more of these than I wish to deal with.
Each time I got rid of one, it would read farther into the file, then stop at the next one.
If I just try to read the top of the file, the first 20 lines before we get to any of these problems, it reads fine.
My partial editing can be found here: https://www.dropbox.com/s/ij707mb23dt1jvz/googleplaystore3.csv?dl=0
I think if you clear up the remaining emojis the file should be usable.

text file of all titles / topic titles in Freebase

I need a text file to contain every title / title of each topic / title of each item in a .txt file each on its own line.
How can I do this or make this if I have already downloaded a freebase rdf dump?
If possible, I also need a separate text file with each topic's / item's description on a single line each description on its own line.
How can I do that?
I would greatly appreciate it if someone could help me make either of these files from a Freebase rdf dump.
Thanks in Advance!
Filter the RDF dump on the predicate/property ns:type.object.name. If you only want a particular language, also filter by that language e.g. #en.
EDIT: I missed the second part about descriptions being desired as well. Here's a three part regex which will get you all the lines with:
English names
English descriptions
a type of /commmon/topic
Combining the three is left as an exercise for the reader.
zegrep $'\tns:(((type\\.object\\.name|common\\.topic\\.description)\t.*#en)|type\\.object\\.type\tns:common\\.topic)\\.$' freebase-rdf-2013-06-30-00-00.gz | gzip > freebase-rdf-2013-06-30-00-00-names-descriptions.gz
It seems to have a performance issue that I'll have to look at. A simple grep of the entire file takes ~11 min on my laptop, but this has been running several times that. I'll have to look at it later though...

How can I extract human-readable text from a code snippet?

I need to write a T-SQL query against a text column where some of the values are html or asp.net coding but include normal human-readable text. For example:
{\colortbl ;\red31\green73\blue125;\red0\green0\blue0;} \viewkind4\uc1\pard\ltrpar\lang1033\f0\fs22 All invoices to be emailed to Jack Jack.Marsman#brampton.ca
I don't need that information I need the real text; in this case I want to get just All invoices to be emailed to Jack Jack.Marsman#brampton.ca
Any suggestions on how to go about extracting the text without getting the coding?
Short answer is that there is no easy standard way to do this. I’d try creating a CLR since this kind of parsing is easier in C# or VB.NET.
You can also try using regex to strip out everything that’s not human readable.
Is all of your data in similar format like you already shown? If that’s the case then it comes down to calling substring several times…

Reading data from an XML file using C

This is follow-up to:
using xslt to create an xml file in c
<element1 type="type1" name="value1">
<start play="no"/>
<element2 aaa="AAA"/>
<element2 bbb="BBB"/>
<element3 ccc="CCC">
<element4/><!-- play="no"/>-->
</element3>
</element1>
Lets say I get this xml file, how do I read individual nodes? I mean, not all nodes are mandatory. Do I need to go though all nodes via "libxml2" or something similar and read its values? OR I can use some sort of schema to define what my xml can look like? What is a better way of dealing with this problem?
A schema is never a bad idea, however it won't help you read the xml as such. All schema would do given you validate the xml against it is tell you it follows whatever rules are in there.
For the rest of it, a quick search on here would have found this. How can libxml2 be used to parse data from XML?

How to export data from an ancient SQL Anywhere?

I'm tasked with exporting data from an old application that is using SQL Anywhere, apparently version 5, maybe 5.6. I never worked with this database before so I'm not sure where to start here. Does anybody have a hint?
I'd like to export it in more or less any text representation that then I can work with. Thanks.
I ended up exporting the data by using isql and these commands (where #{table} is each of the tables, a list I built manually):
SELECT * FROM #{table};
OUTPUT TO "C:\export\#{table}.csv" FORMAT ASCII DELIMITED BY ',' QUOTE '"' ALL;
SELECT * FROM #{table};
OUTPUT TO "C:\export\#{table}.txt" FORMAT TEXT;
I used the CVS to import the data itself and the txt to pick up the name of the fields (only parsing the first line). The txt can become rather huge if you have a lot of data.
Have a read http://www.lansa.com/support/tips/t0220.htm

Resources