scala readCsvFile behaviour - apache-flink

I´m using flink to load a csv file to a dataset of pojos, defined through a scala case class, using readCsvFile method, and I have a problem that i cannot solve.
When in the csv there is a record with some format error in any of its fields it is discarded, and I assume that the only way to keep those records is to type them all as String and do the validations myself.
The problem is that if the last field after the delimiter is empty, the record is discarded by default, I think because it is considered as not having the expected number of fields it should, and it is not possible to handle this record error, while if the empty value if in any of the previous fields there is no problem.
Example
field1|field2|field3
a||c
a|b|
In this example, first record is returned by readCsvFile method but not the second.
Is this behaviour right? and there is any walk around to get the record?
Thanks

Case classes and tuples in Flink do not support null values. Therefore, a||c is invalid if the empty field is not a String. I recommend to use the RowCsvInputFormat in this case. It supports nulls and generic rows can be converted to any other class in a following map operator.

The problem is that, as you say, if the field is a String, the record should be valid even if it is null, and this doesn't happen when the null value is in the last field. The behaviour is different depending on the position.
I will try also with RowCsvInputFormat as you recommend.
Thanks

Related

Snowflake:Export data in multiple delimiter format

Requirement:
Need the file to be exported as below format, where gender, age, and interest are columns and value after : is data for that column. Can this be achieved while using Snowflake, if not is it possible to export data using Python
User1234^gender:male;age:18-24;interest:fishing
User2345^gender:female
User3456^age:35-44
User4567^gender:male;interest:fishing,boating
EDIT 1: Solution as given by #demircioglu
It displays as NULL values instead of other column values
Below the EMPLOYEES table data
When I ran below query
SELECT 'EMP_ID'||EMP_ID||'^'||'FIRST_NAME'||':'||FIRST_NAME||';'||'LAST_NAME'||':'||LAST_NAME FROM tempdw.EMPLOYEES ;
Create your SQL with the desired format and write it to a file
COPY INTO #~/stage_data
FROM
(
SELECT 'User'||User||'^'||'gender'||':'||gender||';'||'age'||':'||age||';'||'interest'||':'||interest FROM table
)
file_format = (TYPE=CSV compression='gzip')
File format here is not important because each line will be treated as a field because of your delimiter requirements
Edit:
CONCAT function (aliased with ||) returns NULL if you have a NULL value.
In order to eliminate NULLs you can use NVL2 function
So your SQL will have series of NVL2s
NVL2 checks the first parameter and if it's not NULL returns first expression, if it's NULL returns second expression
So for User column
'User'||User||'^' will turn into
NVL2(User,'User','')||NVL2(User,User,'')||NVL2(User,'^','')
P.S. I am leaving up to you to create the rest of the SQL, because Stackoverflow's function is to help find the solution, not spoon feed the solution.
No, I do not believe multiple delimiters like this are supported in Snowflake at this time. Multiple byte and multiple character delimiters are supported, but they will need to be specified as the same delimiter repeated for either record or line.
Yes, it may be possible to do some post-processing or use Python scripts to achieve this. Or even SQL transformative statements. This is not really my area of expertise so if someone has an example for you, I'll let them add to the discussion.

strdist function stopped recognizing field reference in SOLR 5

I remember in previous SOLR version (4.x) I was able to run the following query:
"(lastName:HILL)"
with fields
"*,score, strDistLastName:$lnamestrdist"
and with raw query params
"lnamestrdist=strdist('HILL',lastName,jw)"
This would give me result with an additional field which is result of strdist function using Jaro Winkler algorithm between a value and returned field.
For some reason in SOLR version 5.1 it always returns 0, even if strings match 1 to 1 (i.e. strdist should be 1).
I have checked it without using variables, i.e. only specifying fields as
"*,score, strdist('HILL',lastName,jw)"
but it also returns 0.
And only when I use another string literal like below, it returns 1:
"*,score, strdist('HILL','HILL',jw)"
I assume it means that strdist does not recognize fields anymore. Does anyone know why? Maybe syntax has changed or it's simply a bug?
Thank you very much in advance!
My bad, I have "solr.LowerCaseFilterFactory" filter in index/query analyzers for the "lastNameExact" field, therefore value for strdist for 'HILL' and 'hill' was 0. After I corrected fields list to "*,score, strdist('hill',lastName,jw)", strdist became 1.
Sorry for the confusion :)

SSIS Derived Column - Text in Numeric Field is not converting

I'm importing thousands of csv files into an SQL DB. They each have two columns: Date and Value. In some of the files, the value column contains simply a period (ex: "."). I've tried to create a derived column that will handle any cell that contains a period with the following code:
FINDSTRING((DT_WSTR,1)[VALUE],".",1) != 0 ? NULL(DT_R8) : [VALUE]
But, when the package runs it gets the following error when it reaches the cell with the period in it:
The data conversion for column "VALUE" returned status value 2 and status text
"The value could not be converted because of a potential loss of data".
I'm guessing there might be an escape character that I'm missing in my FINDSTRING function but I can't seem to find what that may be. Does anyone have any thoughts on how I can get around this issue?
Trying to debug things like this is why I always advocate adding many Derived Columns to the Data Flow. It's impossible to debug that entire expression. Instead, first find the position of the period and add that as a new column. Then you can feed that into the ternary operation and bit by bit you can add data viewers to ensure you are seeing what you expect to see.
Personally, I'd take a different approach. It seems that you'd like to make any columns that are . into a null of type DT_R8.
Add a derived column, TrimmedValue and use this expression to remove any leading/trailing whitespace and then
RTRIM(LTRIM(Value))
Add a second derived column component, this time we'll add column MenopausalValue as it will remove the period. Use this expression
(TrimmmedValue == ".") ? Trimmedvalue : NULL(DT_WSTR, 50)
Now, you can add your final Derived Column wherein we convert the string representation of Value to the floating point representation.
IsNull(MenopausalValue) ? NULL(DT_R8) : (DT_R8) MenopausalValue
If the above shows an error, then you need to apply the following version as I can never remember the evaluation sequence for ternary operations that change type.
(DT_R8) (IsNull(MenopausalValue) ? NULL(DT_R8) : (DT_R8) MenopausalValue)
Examples of breaking these operations into many steps for debugging purposes
https://stackoverflow.com/a/15176398/181965
https://stackoverflow.com/a/31123797/181965
https://stackoverflow.com/a/33023858/181965
You can do it like this:
TRIM(Value) == "." ? NULL(DT_R8) : (DT_R8)Value

Using text in a column of Date/Time type in access

I have a column in MS Access in which the data could be any of the following:
A date
Text string: "n/a"
Text string: "n/e"
The vast majority of entries will be dates but a very few will need to be these specified text strings. I would like to still be able to perform date calculations on the column. Whats the best datatype to use?
In my opinion the best approach would be to leave the date field as Date/Time and then add another field to indicate the status if the Date/Time field is Null. Something like:
DateField DateStatus
--------- ----------
2014-09-21
n/a
2014-09-23
2014-09-25
n/e
You could use a single Text field, but then any time you wanted to use the field value as a proper Date/Time value you'd have to convert it using CDate(). You would also have the possibility of other junk getting in there, or dates getting entered in different formats (e.g. d/m/yyyy vs. m/d/yyyy). And finally, you would lose the ability to easily determine whether a Date/Time value is in a particular row (which in my approach would simply be ... WHERE DateField IS [NOT] NULL).
I agree with Gord Thompson's answer - mainly because it's so non-intuitive to have, essentially, two completely different types of data in a single column, and because it's going to make validation/data integrity stuff so much harder with little upside - and, as he indicates with the CDate() reference, dates basically only work reliably like dates if they're in a "date/time" field. Microsoft has a page on choosing a data type that explains some of the Access-specific differences in more detail.
I also suggest that you don't actually have a text field for those "comments," since you say there's only a handful of potential options - use a Long Integer and connect back to a separate table with the list of allowable entries. This will allow you to run reports more easily, change the "display text" in one step instead of potentially dozens of times, etc. It also saves a relatively small amount of space per record (long integer = 4 bytes; text = up to 255 bytes.)
You can also do fun data/reporting stuff with that Comment (long integer) field and dates - even combined into ranges, by the way - queries let you use the two different columns to create a single answer. I have a report that's grouped so that you can see stats for everything that's active (by quarter in which they start) plus everything that's pending (with the code indicating who's responsible for watching this record,) plus everything that's not pending but still doesn't have a start date (with the reason code displayed,) plus everything that's expired (by quarter in which they ended.) It looks like each of those things is in a single column in the report, but it's actually like five columns that have been concatenated with the IIf function.
(Almost every argument I can come up with boils down to "this is what relational databases are all about and why they're so awesome.)

NULL vs Empty when dealing with user input

Yes, another NULL vs empty string question.
I agree with the idea that NULL means not set, while empty string means "a value that is empty". Here's my problem: If the default value for a column is NULL, how do I allow the user to enter that NULL.
Let's say a new user is created on a system. There is a first and last name field; last name is required while first name is not. When creating the user, the person will see 2 text inputs, one for first and one for last. The person chooses to only enter the last name. The first name is technically not set. During the insert I check the length of each field, setting all fields that are empty to NULL.
When looking at the database, I see that the first name is not set. The question that immediately comes to mind is that maybe they never saw the first name field (ie, because of an error). But this is not the case; they left if blank.
So, my question is, how do you decide when a field should be set to NULL or an empty string when receiving user input? How do you know that the user wants the field to be not set without detecting focus or if they deleted a value...or...or...?
Related Question: Should I use NULL or an empty string to represent no data in table column?
I'll break the pattern, and say that I would always use NULL for zero-length strings, for the following reasons.
If you start fine-slicing the implications of blanks, then you must ensure somehow that every other developer reads and writes it the same way.
How do you alphabetize it?
Can you unambiguously determine when a user omitted entering a value, compared with intentionally leaving it blank?
How would you unambiguously query for the difference? Can a query screen user indicate NULL vs. blank using standard input form syntax?
In practice, I've never been prohibited from reading and writing data using default, unsurprising behavior using this rule. If I've needed to know the difference, I've used a boolean field (which is easier to map to unambiguous UI devices). In one case, I used a trigger to enforce True => null value, but never saw it invoked because the BR layer filtered out the condition effectively.
If the user provides an empty string, I always treat it as null from the database perspective. In addition, I will typically trim my string inputs to remove leading/trailing spaces and then check for empty. It's a small win in the database with varchar() types and it also reduces the cases for search since I only need to check for name is null instead of name is null or name = '' You could also go the other way, converting null to ''. Either way, choose a way and be consistent.
What you need to do is figure out what behavior you want. There is no one fixed algebra of how name strings are interpreted.
Think about the state machine here: you have fields which have several states: it slounds like you're thinking about a state of "unitialized", another of "purposefully empty" and a third with some set value. ANYTHING you do that makes that assignment and is consistent with the rest of your program will be find; it sounds like the easy mapping is
NULL → uninitialized
"" → purposefully unset
a name → initialized.
I almost never use NULL when referring to actual data. When used for foreign keys, I would say that NULL is valid, but it is almost never valid for user entered data. The one exception that would probably come up quite regularly is for dates that don't exist, such as an employee database with a "termination_date" field. In that case, all current employees should have a value of NULL in that field. As for getting them to actually enter a null value, for the values that truly require a null value, I would put a checkbox next to the input field so that the user can check it on and off to see the corresponding value to null (or in a more user friendly manner, none). When enabling the checkbox to set the field to null, the corresponding text box should be disabled, and if a null value is already associated, it should start out as disabled, and only become enabled once the user unchecks the null checkbox.
I try to keep things simple. In this case, I'd make the first-name column not-nullable and allow blanks. Otherwise, you'll have three cases to deal with anywhere you refer to this field:
Blank first name
Null first name
Non-blank first name
If you go with 'blank is null' or 'null is blank' then you're down to two cases. Two cases are better than three.
To further answer your question: the user entering data probably doesn't (and shouldn't) know anything about what a "null" is and how it compares to an "empty". This issue should be resolved cleanly and consistently in the system--not the UI.
While your example is mostly for strings, I like to say that I use null for numerical and boolean fields.
An accountbalance of 0 is very different to me as one that is null. Same for booleans, if people take a multiple choice test with true and false answers, it is very important to know whether someone answered true or false or didn't answer at all.
Not using null in these cases would require me to have an extra table or a different setup to see whether someone answered a question. You could use for instance -1 for not filled in 0 for false and 1 for true, but then you're using a numerical field for something that's essentially a boolean.
I've never, ever had a use for a NULL value in production code. An empty string is a fine sentinel value for a blank name field, phone number, or annual income for any application. That said, I'm sure you could find some use for it, but I just think it's overused. If I were to use a NULL value, however, I imagine I'd use it anywhere I want to represent an empty value.
I have always used NULL for uninitialized values, empty for purposely empty values and 0 for off indicators.
By doing this all the time, it is there even if I am not using it, but I don't have to do anything different if I need that distinction.
I am usually testing for empty(), but sometimes I check for isset() which evaluates false on NULL. This is useful for reminders to answer certain questions. If it is empty, false or 0 then the question is answered.

Resources