What is the best way to validate and parse complex fields in Pentaho kettle?

What is the best way to validate and parse complex fields in Pentaho kettle? - database

What is the best way to validate fields in a row and if invalid, correct it to the right form?
The simplest example would be checking phone number field (can come in variant formats -> 111-111-1111, (111) 111-1111 etc), and we would ideally want to validate these and standardize to one form (lets say: 1111111111). One way to do this is to use filter rows and then use a regex, or we can use data validator. But this will only tell us what data is invalid but not actually format it for us. We can then use Javascript modified value step to write a js script to do this. But I am guessing there is a better way (or a built in integration that I haven't come across) that would do these basic validations. Or is it recommended to just dump rows containing invalid fields in a separate csv file and then use a script to parse it separately?

g'day
i use the excellent 'replace in string' step to handle this circumstance
you can cumulatively apply rules for removing bad char from strings within the single step - it is really easy to use for single-char fixes like what you have described, and best of all, it also allows you to search based on regex as well - in a single step you have documented your standardisation and produced the clean output
in your case, i would create two 'rules' to replace ( and ) with nothing - however, the - is a little trickier; you need a rule for each removal of a single char, so you would need to know the maximum number of - in a single data field, then add this many lines to your 'replace in string' step
if this is unpalatable, consider the 'user defined java expression' and a call to replace, eg: ( (t0 != null) ? t0.replace("-","") : t0 )
as i stated, each 'fix' is applied in sequential order - the In stream field is the input field-name, whereas the Outstream field is left blank instructing d.i. to modify the field itself - here's a more complex example where i search for regex and replace them with nothing, escape for the case where i escape a " double-quote:
In stream field Out stream field use RegEx Search Replace with
sc_srcuri N {Internal.Transformation.Filename.Directory}
re_s_sciname Y ["] \\"
re_s_sciname Y .[\x08]
re_s_sciname Y .[\x08]
re_s_sciname Y .[\x08]
re_s_sciname Y [*]
re_s_sciname Y \s*$
re_s_sciname Y ^\s*
notice i am removing up to three 'delete' control-codes [\x08] from this particular string?

Related

How to split a search string into parts, then check parts against a database

Here's what I'm dealing with:
We have a database of machines and their part lists are specified using strings. For example, one machine might be specified with the string &XXX&YYY-ZZZ, meaning the machine contains parts XXX and YYY and not ZZZ.
We use &XXX to specify that a part exists in a machine, and -XXX to specify that a part does not exist in a machine.
It's also possible that a part is not listed (i.e. not specified whether or not it exists in the machine). For example I might only have &XXX&YYY (ZZZ is not specified).
Additionally, the codes can be in any order, for example I might have &XXX&YYY-ZZZ or &XXX-ZZZ&YYY.
In order to search for machines, I get a string like this: &XXX-YYY/&YYY&ZZZ (/ is an OR operator), meaning "I want to find all machines that either a) contain XXX and do not contain YYY, or b) contain both YYY and ZZZ.
I'm having trouble parsing the string based on the variable ordering, possibility that parts may not be shown, and handling of the / operator. Note, we use Microsoft 365.
Looking for some suggestions!
When I search for &XXX-YYY/&YYY&ZZZ, I should return the following machines:
Machine
Result
&XXX-YYY&ZZZ
TRUE (because XXX exists and YYY does not exist)
&XXX-YYY-ZZZ
TRUE (because XXX exists and YYY does not exist)
&XXX&YYY&ZZZ
TRUE (because YYY exists and ZZZ exists)
&XXX&ZZZ
FALSE (because YYY is specified in the search, but this machine doesn't specify it)
&ZZZ&YYY
TRUE (showing that parts can be in any order)

You can try it in cell C2 with the following formula:
=LET(query, A2, queries, TEXTSPLIT(query,, "/"), input, B2:B7,
qryNum, ROWS(queries),
SPLIT, LAMBDA(txt,LET(str, SUBSTITUTE(SUBSTITUTE(txt, "&",";1_"),
"-",";0_"), TEXTSPLIT(str,,";",TRUE))),
lkUps, DROP(REDUCE("", queries, LAMBDA(acc,qry, HSTACK(acc, SPLIT(qry)))),,1),
MAP(input, LAMBDA(txt, LET(str, SPLIT(txt),
out, REDUCE("", SEQUENCE(qryNum, 1), LAMBDA(acc,idx,
LET(cols, INDEX(lkUps,,idx), qry, FILTER(cols, cols<>""),
matches, SUM(N(ISNUMBER(XMATCH(str, qry)))),
result, IF(ROWS(qry)=matches,1,0),IF(acc="", result, MAX(acc, result))
))), IF(out=1, TRUE, FALSE)
)))
)
and here is the corresponding output:
Assumptions:
String values (operation and part) should be unique, i.e. the case &XXX-YYY&XXX is not considered, because &XXX is duplicated.
Explanation
The main idea is to transform the input information in a way we can do comparisons at array level via XMATCH. The first thing to do is to identify each OR condition in the search string because we need to test each one of them against the Input column. The name queries is an array with all the OR conditions.
We can transform the string inputs in a way we can split the string into an array. SPLIT is a user LAMBDA function that does that:
SUBSTITUTE(SUBSTITUTE(txt, "&",";1_"),"-",";0_"), TEXTSPLIT(str,,";",TRUE)))
What it does is convert for example the input: &XXX-YYY&ZZZ into the following array:
1_XXX
0_YYY
1_ZZZ
We change the original operations &,- into 1,0 just for convenience, but you can keep the original operation value, it is irrelevant for the calculation. It is important to set the fourth TEXTSPLIT input argument to TRUE to ensure no empty rows are generated.
The name lkUps is an array with all the OR conditions organized by column for query. In the format we want, for example:
1_XXX 1_YYY
0_YYY 1_ZZZ
Note: For creating lkUps we use the pattern: DROP/REDUCE/HSTACK, for more information about it, check the answer to the question: how to transform a table in Excel from vertical to horizontal but with different length provided by #DavidLeal.
Now we have all the elements we need to build the recurrence. We use MAP to iterate over all Input column values. For each element (txt) we transform it to the format of our convenience via SPLIT user LAMBDA function and name it str.
We use REDUCE function inside MAP to iterate over all columns of lkUps to check against str. We use SEQUENCE(qryNum, 1) as input of REDUCE to be able to iterate over each lkUps column (qry).
Now we are going to use the above variables in XMATCH and name the variable matches as follows:
SUM(N(ISNUMBER(XMATCH(str, qry))))
If all values from qry were found in str then we have a match. If that is the case the item of the SUM will be 1, otherwise 0. Therefore the SUM for the match case should be of the same size as qry.
Because we include in the XMATCH both parts and operations (1,0), we ensure that not just the same parts are found, but also their corresponding operations are the same. The order of the parts is not relevant, XMATCH ensures it.
The REDUCE recurrence keeps the maximum value from the previous iteration (previous OR condition). We just need at least one match among all OR conditions. Therefore once we finish all the recurrence, if the result value of REDUCE is 1 at least one match was found. Finally, we transform the result into a TRUE/FALSE.
Note: For a large list of operations instead of using the above approach of two SUBSTITUTE calls. The SPLIT function can be defined as follow:
LAMBDA(txt,tks, LET(seq, SEQUENCE(COLUMNS(tks),1),
out, REDUCE("", seq, LAMBDA(acc,idx, LET(str, IF(acc="", txt, acc),
SUBSTITUTE(str, INDEX(tks,1,idx), INDEX(tks,2,idx))))),
TEXTSPLIT(out,,";",TRUE)))
and the input tks (tokens) can be defined as follow: {"&","-";"1_", "0_"}, i.e. in the first row old values and in the second row the new values.

Replace All and reduce text string

I have Google Forms results from live workshops that I want to edit in Sheets. I then want to calculate averages, pivot etc. to explore the data for insights.
I want to clean and standardize the data. I'd like to replace a series of long strings and reduce them just to their leading number:
UPDATE as requested here is a View-only sample Sheet with raw data on 1st tab, and how I'd like it to be formatted (2nd tab)
https://docs.google.com/spreadsheets/d/1pP8YV3oJXWGt3-88qgzMuup1pm1IsIgD4SHDL09MXrc/edit?usp=sharing
Example:
I'm aware of certifications I should be starting.
Replace with
3
Example 2:
I'm currently progressing a powerful certification.
Replace with:
4
In Excel I would simply use * as wildcard for the rest of the string, but Sheets appears different. I've read documentation and posts about regular expressions etc. and I'm not sure if that's overkill or how to proceed.
I'd THEN like to create a macro which does that for the whole sheet:
All strings which begin with 1)*
--> Replace with 1
All strings which begin with 2)*
--> Replace with 2

try:
=REGEXEXTRACT(A1; "^\d+")

query parameter preserve as json

I'm trying to store API request query parameters in JSON format, in a way that preserves the inferred original types of the parameters' values. I do this without knowing what these APIs look like beforehand.
The code below deals with each query argument (delimited by &) one by one.
for (int i = 0; i < url_arg_cnt; i++) {
const http_arg_t *arg = http_get_arg(http_info, i);
if (cJSON_GetObjectItem(query, arg->name.p) == NULL) {
// Currently just treating as a string.
cJSON_AddItemToObject(query, arg->name.p, cJSON_CreateString(arg->value.p));
SLOG_INFO("name:value is %s:%s\n", arg->name.p, arg->value.p);
} else {
//duplicate key.
}
With the above code, for input
?start=0&count=2&format=policyid|second&id%5Bkey1%5D=1&id[key2]=2&object=%7Bone:1,two:2%7D&nested[][foo]=1&nested[][bar]=2
I get these prints:
name:value is start:0
name:value is count:2
name:value is format:policyid|second
name:value is id[key1]:1
name:value is id[key2]:2
name:value is object:{one:1, two:2}
name:value is nested[][foo]:1
name:value is nested[][bar]:2
According to this document and other places I've researched,
https://swagger.io/docs/specification/serialization/
There is no consensus on how the query parameters are passed, therefore no guarantee what I could encounter here. So my goal is to support as many variations as possible.
These possibilities seem to be the most common:
Arrays:
?x = 1,2,3
?x=1&x=2&x=3
?x=1%202%203
?x=1|2|3
?x[]=1&x[]=2
String:
?x=1
Object, could be nested:
?x[key1]=1&x[key2]=2
?x=%7Bkey1:1,key2:2%7D
?x[][foo]=1&x[][bar]=2
?fields[articles]=title,body&fields[people]=name
?x[0][foo]=bar&x[1][bar]=baz
Any ideas how to best go about this? Basically for these query parameters I want to aggregate ('exploded') arguments that belong together and save to query proper intended json objects. Line in question:
cJSON_AddItemToObject(query, arg->name.p, cJSON_CreateString(arg->value.p));

Converting the URI query to JSON
This post will provide more generic (canonical) approach toward the problem of extraction of the variables from the URI string.
The query is defined across several descriptive standards (RFCs and specifications), so tho have canonical approach, we need to use the specifications to create a normalized form of the query before we can build the object.
TL;DR
To assure that we can be implement the specifications with the ability to cater for future extensions, the algorithm to convert the query to JSON should be separated in steps, each one gradually building the normalized form of the query, before it can be converted to JSON object. To do so, we need the following steps:
Extract the query from the URI
Split to key=value
Normalize the key (build the object hierarchy)
Normalize the value (populate the object attributes and build the attribute arrays)
Build JSON object based on the normalized key=value
Such separation of the steps will allow much easier adoption of future changes in the specifications. The parsing of the values can be done with RegEx or with a parser (BNF, PEG, etc.).
Conversion steps
First thing to be done is to extract the query string from the URI. This is described in the RFC3986 and will be explained in it's own section Extracting the query string. The extraction of the query segment, as we will see later, can be easily done with RegEx.
After query string is extracted from the URI, one needs to interpret the information conveyed by the query. As we will see below, the query has a very loose definition in the RFC3986, and the case where the query is conveying variables is further elaborated in RFC6570. During the extraction, the algorithm should extract the values (that are in form of key=value) and store them in a map structure (one approach would be to use strict as described in following SO post. The section Interpreting the query string provides overview of the process.
After the variables are separated and placed in form of key=value, next stage is to normalize the key. Proper interpretation of the key will allow us to build the hierarchical structure of the JSON object from the key=value structure. The RFC6570 is not providing much information on how the key can be normalized, however the OpenAPI specification provides a good information how to handle different types of key. The normalization will be further elaborated in section Normalizing the key
Next we need to normalize the variables by continuing to build on the RFC6570 which defines the types of the variables in several levels. This will be further elaborated in section Normalizing the value
Final stage is to build the JSON object with cJSON_AddItemToObject(query, name, cJSON_CreateString(value));. More details will be discussed in the Building the JSON Object section.
During implementation, some of the steps can be merged to a single step to optimize the implementation.
Extracting the query string
The RFC3986 which is the main descriptive standard that is governing the URI is defining the URI as:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
The query part is defined in the section 3.4 of the RFC as the segment of the URI such as:
... The query component is indicated by the first question
mark ("?") character and terminated by a number sign ("#") character
or by the end of the URI. ...
The formal syntax of the query segment is defined as:
query = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
This means that the query can contain more instances of ? and / before the # is met. Actually, as long as the characters after first occurrence of the? are in the set of characters that do not have special meaning, everything that is found until first # is encountered is the query.
At the same time, this also implies that the sub-delimiter &, as well as the ? has no special meaning according to this RFC when is encountered inside the query string, as long as it's in the proper form and position in the URI. This implies that each implementation can define its own structure. The language of RFC in chapter 3.4 confirms such implications by leaving space for other interpretations by using often instead of always
... However, as query components
are often used to carry identifying information in the form of
"key=value" pairs ...
In addition, the RFC also provides the following RegEx that can be used to extract the query part from the URI:
regex : ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
segments: 12 3 4 5 6 7 8 9
Where the capture #7 is the query from the URI.
The easiest approach for extracting the query, provided that we are not interested in the remaining parts of the URI, is to use the RegEx to split the URI and extract the query string that will not contain the leading ? nor the terminating #.
This RFC3986 is further extended with the RFC3987 in order to cover the international characters, however the RegEx defined by the RFC3986 remains valid
Extracting variables from the query string
To decompose the query string to key=value pairs, we need to do reverse engineering of the RFC6570 which establishes a descriptive standard for the expansion of the variables and constructing the valid query. As the RFC is stating
... A URI Template provides both a structural description of a URI space
and, when variable values are provided, machine-readable instructions
on how to construct a URI corresponding to those values. ...
From the RFC, we can extract the following syntax for a variable in the query:
query = variable *( "&" variable )
variable = varname "=" varvalue
varvalue = *( valchar / "[" / "] / "{" / "}" / "?" )
varname = varchar *( ["."] varchar )
varchar = ALPHA / DIGIT / "_" / pct-encoded
pct-encoded = "%" HEXDIG HEXDIG
valchar = unreserved / pct-encoded / vsub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
vsub-delims = "!" / "$" / "'" / "(" / ")"
/ "*" / "+" / ","
The extraction can be performed with a parser that implements the above grammar, or by iterating over the query with the following RegEx and extracting the (key, value) pairs.
([\&](([^\&]*)\=([^\&]*)))
In case we use RegEx, note that in previous section we had omitted the "?" at the start of the query and "#" at the end, so we need don't need to handle this characters in the separation of the variables.
Normalizing the key
There descriptive standard RFC6570 provides generic rules of the format of the key, the RFC is not helping much when it comes to the rules for the interpretation of the key when an object is constructed. Some of the specifications such as the OpenAPI specification, JSON API Specification), etc. can help with the interpretation, but they are not providing the full set of rules, rather a subset. To make the things wort, some of the SDKs (ex. PHP SDK) have its own rules for building the keys.
In such situation, the best approach is to create a hierarchical rules for key normalization that will convert the key to a unified format, similar to json path dot notation. The hierarchical rules will allow us to control how the ambiguous situations (in case of collisions between specifications), but controlling the order of the rules. The json path notation will allow us to build the object in the final step without the necessity to have proper order of the key=value pairs.
Following is the grammar of the normalized format:
key = sub-key *("." sub-key )
sub-key = name [ ("[" index "]") ]
name = *( varchar )
index = NONZERO-DIGIT *( DIGIT )
This grammar will allow for keys such as foo, foo.baz, foo[0].baz, foo.baz[0], foo.bar.baz etc.
Following are a good starting point to set of rules and the transformation
Flat key (key -> key)
Attribute key (key.atr -> key.atr)
Array key (key[] -> key[0])
Object Array key (key[attribute] -> key.attribute), (key[][attribute] -> key[0].attribute), (key[attribute][] -> key.attribute[0])
More rules can be added to address special cases. During the transformation, the algorithm should pass from the most specific rules (the bottom rules) to the most generic rules and try to find a full match. If a full match if found, the key will be overwritten with the normal form and the remaining rules will be skipped.
Normalizing the value
Similar to the normalization of the key, the value should also be normalized in cases where the value represents a list. We will need to convert the value from the arbitrary list format to the form format (coma separated list) which is defined by the following grammar:
value = singe-value *( "," singe-value )
singe-value = *( unreserved / pct-encoded )
This grammar will allow us the value to take form a, a,b, a,b,c, etc.
Extracting the list of the values from the value string can be done with splitting the string by the valid delimiters (",",";","|", etc.) and producing the list in a normalized form.
Building the JSON Object
Once the keys and the values are normalized, converting the flat list (the map structure) to a JSON Object can be done by a singe pass trough all of the keys in the list. The normalized format of the key will help us, since the key conveys the whole information about his hierarchy in the object, so even if we had not encountered some of the intermediate attributes, we are able to build the object.
Similar, we can recognize if the value of the attribute should be a flat string or an array from the variable itself, so here as well, no additional information is required to create the proper representation.
Alternative approach
As alternative approach, we can construct a full grammar that will create the AST (abstract syntax tree), and use the tree to produce the JSON object, however due to the multiple variations of the formats and ability to have future extensions, this approach will be less flexible.
Useful links
The grammar in the text is following ABNF grammar rules
JSON Path
GNU Bison is example of BNF parser
C PEG parser library is example of PEG parser

I recently ran into the same issue and will share some wisdom gained from the episode.
I'm assuming you are implementing this on a MITM device (web firewall, etc.).
As notedly in the question, there is no consensus in how the query parameters are passed. Not one standard or a set of rules that govern this -- in fact, any server may implement its own syntax, as long as the syntax is supported by the server code. The best one can do is to 1) decide what query parameter forms to support (do the best you can, maybe as many as possible) and 2) support only those forms, treat the rest (ones not supported) as String values, like your current code does.
It's not worth it to fret too much about the accuracy of the preservation/inference of type in question, or formalizing/generalizing it for a heavyweight solution because 1) the arbitrariness of syntax you may encounter (not necessarily conforming to any standard, web servers can really do whatever they want, therefore the query parameters often don't conform to the, say, swagger standard referenced) and 2) looking at the query parameters only gives you so much information -- the benefit/value of implementing anything more than vague approximations (per rules defined by yourself, as stated before) is hard to be seen. Think about even the simplest of cases, how vague they can be: you sorta have to pretend in the x=something&x=something exploded case, arrays have to have at least two elements. If only one element -- x=something -- you treat it as a string, for how else do you know whether it's an array or a string? How about the x=1 case, is 1 a string or a number, the original / intended type? Also, how about x=foo&y=1 | 2 | 3? or when you see "1, 2, 3", with spaces? Are the spaces supposed to be ignored, are they array delimiters themselves, or are they actually a part of the array elements. Finally, how do you even know the intended string is not "1 | 2 | 3" itself, meaning it's not an array!
So the best one can do in parsing these strings and trying to support/ infer all these variations (different rules) is to define ones own rules (what one is okay/happy with) and support only those.

SQL Server validating postcodes

I have a table containing postcodes but there is no validation built in to the entry form so there is no consistency in the way they are stored in the database, sample below:
ID Postcode
001742 B5
001745
001746
001748 DY3
001750
001751
001768 B276LL
001774 B339HY
001776 B339QY
001780 WR51DD
I want to use these postcode to map the distance from a central point but before I can do that I need to put them into a valid format and filter out any blanks or incomplete postcodes.
I had considered using
left(postcode,3) + ' ' + right(postcode,3)
To correct the formatting but this wouldn't work for postcodes like 'M6 8HD'
My aim is to get the list of postcodes in a valid format but I don't know how to account for different lengths of postcode. Is this there a way to do this in SQL Server?

As discussed in the comments, sometimes looking at a problem the other way around presents a far simpler solution.
You have a list of arbitrary input provided by users, which frequently doesn't contain the correct spacing. You also have a list of valid postcodes which are correctly spaced.
You're trying to solve the problem of finding the correct place to insert spaces into your arbitrary inputs to make them match the list of valid codes, and this is extremely difficult to do in practice.
However, performing the opposite task - removing the spaces from the valid postcodes - is remarkably easy to do. So that is what I'd suggest doing.
In our most recent round of data modelling, we have modelled addresses with two postcode columns - PostCode containing the postcode as provided from whatever sources, and PostCodeNoSpace, a computed column which strips whitespace characters from PostCode. We use the latter column for e.g. searches based on user input. You may want to do something similar with your list of Valid postcodes, if you're keeping it around permanently - so that you can perform easy matches/lookups and then translate those matches back into a version that has spaces - which is actually a solution to the original question posed!

select first two characters of values in a concatenated string

I am trying to create a formula field that checks a string that is a series of concatenated values separated by a comma. I want to check the first two characters of each comma separated value in the string. For example, the string pattern could be: abcd,efgh,ijkl,mnop,qrst,uvwx
In my formula I'd like to check if the first two characters are 'ab','ef'
If so, I would return true, else false.
Thanks.

To do this properly, you need to use a regular expression. Unfortunately the REGEX function is not available in formula fields. It is, however, available in formulas in Validation Rules and in Workflow Rules. You can, therefore, specify the below formula in either of a Validation or Workflow Rule:
OR(
AND(
NOT(
BEGINS( KXENDev__Languages__c, "ab" )
),
NOT(
BEGINS( KXENDev__Languages__c, "ef" )
)
),
REGEX( KXENDev__Languages__c , ".*,(?!ab|ef).*")
)
If it's a Validation Rule, you're done -- this formula will create an error if any of the entries do not start with "ab" or "ef". If it's a Workflow Rule, then you can add a Field Update to it to update some field with False when this formula is true (if this formula is true then there is at least one item that doesn't start with ab or ef, so that would make your field False).
Some may ask "What's with the BEGINS statements? Couldn't you have done this all with one REGEX?" Yes, I probably could, but that makes for an increasingly complex REGEX statement, and these are quite difficult to debug in Salesforce.com, so I prefer to keep my REGEXes in Salesforce.com as simple as possible.

I suggest you to search for ',ab' and ',ef' using CONTAINS method. But first of all you need to re implement method which composes this string so it puts ',' before first substring. At the end returned string should look like ',abcd,efgh,ijkl,mnop,qrst,uvwx'.
If you are not able to re implement method which compose this string use LEFT([our string goes here],2) method to check first two chars.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight