import complex data structure in hive with custom separators - arrays

I have a huge dataset with the following structure
fieldA,fieldB,fieldC;fieldD|fieldE,FieldF;fieldG|fieldH,FieldI ...
where:
fieldA,fieldB and fieldC are strings that should be imported into separate columns
fieldD|fieldE,FieldF;fieldG|fieldH,FieldI is an array (elements separated by semicolon) of maps (elements separated by |) of arrays (elements separated by comma, e.g. fieldE,FieldF)
My problem is that the initial array is separated from the fieldA,fieldB,fieldC with a semicolon. My question is how do I set the separators correctly when I create a table.
This one does not recognize an array - although I provide a semicolon as a field separator
CREATE TABLE string_array(
first_part STRING # this would be to store fieldA,fieldB,fieldC
,second_part ARRAY<STRING> # this would be to store fieldD|fieldE,FieldF;fieldG|fieldH,FieldI and split it by semicolon
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\\u003b'
COLLECTION ITEMS TERMINATED BY '\\u003b'
MAP KEYS TERMINATED BY '|'
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH '...' INTO TABLE string_array;
Any ideas how to make it work so I can build upon it? Thanks a lot in advance!

Great question.
I think that we can break this problem up into two discrete pieces: (1) Hive table structure, and (2) data delimiters.
Let's start by looking at the Hive table structure. If I understood your data structure correctly (please correct me if I didn't), the table structure that would best describe your data could be represented as:
CREATE TABLE string_array
AS
SELECT 'fieldA,fieldB,fieldC' AS first_part, array(map('fieldD', array('fieldE', 'FieldF')), map('fieldG', array('fieldH','FieldI'))) AS second_part;
Note that the field second_part is an array of maps, where the key to each map references an array of strings. In other words, the field second_part consists of an array within a map within an array.
If I use the statement above to create a table, I can then copy the resulting table to the local filesystem and look at how Hive assigns default delimiters to it. I know that you don't want to use default delimiters, but please bear with me here. The resulting table looks like this in its serialized on-disk representation:
00000000 66 69 65 6c 64 41 2c 66 69 65 6c 64 42 2c 66 69 |fieldA,fieldB,fi|
00000010 65 6c 64 43 01 66 69 65 6c 64 44 04 66 69 65 6c |eldC.fieldD.fiel|
00000020 64 45 05 46 69 65 6c 64 46 02 66 69 65 6c 64 47 |dE.FieldF.fieldG|
00000030 04 66 69 65 6c 64 48 05 46 69 65 6c 64 49 0a |.fieldH.FieldI.|
If we look at how Hive sees the delimiters we note that Hive actually sees five types or levels of delimiters:
delimiter 1 = x'01' (between fieldC & fieldD) -- between first_part and second_part
delimiter 2 = x'02' (between fieldF & fieldG) -- between the two maps in the array of maps
delimiter 3 = x'03' not used
delimiter 4 = x'04' (between fieldD & fieldE) -- between the key and the array of fields within the map
delimiter 5 = x'05' (between fieldE & fieldF) -- between the fields within the array within the map
And herein lies your problem. Current versions of Hive (as of 0.11.0) only allow you to override three levels of delimiters. But due to the levels of nesting within your data, Hive is seeing a requirement for greater than three levels of delimiters.
My suggestion would be to pre-process your data to use Hive's default delimiters. With this approach you should be able to load your data into Hive and reference it.

Related

Possible (invisible) characters in text pasted from Excel or XML to a TSQL query

My ASP.Net application receives an uploaded Excel file, then converts to XML to pass as a stored procedure parameter for processing. This file contains lots of entries that are text-matched to find the corresponding value.
Today (after working perfectly for years) three seemingly identical values caused mixed results on the text match. I can only illustrate with an example:
The first value is the original value within Excel (copied from the
spreadsheet).
The second value is that which the web application
saves to XML, and is the raw value taken from the profiler showing
the parameter values
The third value is that which currently exists
in the table
The code is pasted from SSMS:
SELECT SkillGroupTitle FROM tbl_SkillGroups WHERE SkillGroupTitle = N'Quality Assurance Instruction (QAI)'; -- pasted from Excel
SELECT SkillGroupTitle FROM tbl_SkillGroups WHERE SkillGroupTitle = N'Quality Assurance Instruction (QAI)'; -- pasted from xml
SELECT SkillGroupTitle FROM tbl_SkillGroups WHERE SkillGroupTitle = N'Quality Assurance Instruction (QAI)'; -- pasted from SQL table (existing value)
Can somebody please advise what is happening here? All values appear identical visually, but there's clearly something different within the inbound values from Excel.
Update
Pasting the two values into a hex converter, there are indeed three differences.
Excel data:
51 75 61 6c 69 74 79 a0 41 73 73 75 72 61 6e 63 65 a0 49 6e 73 74 72 75 63 74 69 6f 6e 20 28 51 41 49 29 0a
-- -- --
SQL data:
51 75 61 6c 69 74 79 20 41 73 73 75 72 61 6e 63 65 20 49 6e 73 74 72 75 63 74 69 6f 6e 20 28 51 41 49 29
Can anyone shed any light here please?
Updated
First, to identify hidden characters causing strings not to be equal you can leverage ngrams8k like this:
WITH tbl_SkillGroups(dSource, SkillGroupTitle) AS
(
SELECT 'Excel', N'Quality Assurance Instruction (QAI)' -- pasted from Excel
UNION ALL
SELECT 'XML', N'Quality Assurance Instruction (QAI)'+CHAR(10) -- pasted from xml
UNION ALL
SELECT 'SQL', N'Quality Assurance Instruction (QAI)'+CHAR(13) -- pasted from SQL table (existing value)
)
SELECT
[Source] = ts.dSource,
Position = ng.Position,
Token = ng.Token,
asciiValue = ASCII(ng.Token)
FROM tbl_SkillGroups AS ts
CROSS APPLY samd.ngrams8k(ts.SkillGroupTitle,1) AS ng
--WHERE ng.Position > 32 -- Zoom into the last few characters
Returns:
Source Position Token asciiValue
------ --------- ------- -----------
Excel 33 A 65
Excel 34 I 73
Excel 35 ) 41 -- Only 35 characters
XML 33 A 65
XML 34 I 73
XML 35 ) 41
XML 36 10 -- 36th character is a CHAR(10) (Looks like a space)
SQL 33 A 65
SQL 34 I 73
SQL 35 ) 41
SQL 36 13 -- 36th character is a CHAR(13) (Also looks like a space)
NEXT, to clean hidden characters from your inputs you can use PatReplace8k.
WITH tbl_SkillGroups(dSource, SkillGroupTitle) AS
(
SELECT 'Excel', N'Quality Assurance Instruction (QAI)' -- pasted from Excel
UNION ALL
SELECT 'XML', N'Quality Assurance Instruction (QAI)'+CHAR(10) -- pasted from xml
UNION ALL
SELECT 'SQL', N'Quality Assurance Instruction (QAI)'+CHAR(13) -- pasted from SQL table (existing value)
)
SELECT SkillGroupTitle
FROM tbl_SkillGroups as ts
CROSS APPLY PatReplace8K(ts.Skillgrouptitle,'[^a-zA-Z ()]','') as pr
WHERE SkillGroupTitle = pr.NewString
Returns:
SkillGroupTitle
------------------------------------
Quality Assurance Instruction (QAI)
Quality Assurance Instruction (QAI)
Quality Assurance Instruction (QAI)
Here, PatReplace8k would remove any characters that didn't match the pattern (letters, spaces and parentheses) thus making these three values equal.

Old school "Commodore 64" BASIC - Peek/Poke commands; is there an equivalent in BATCH form?

I'm an 'Old Timer' that learned to program on a Commodore 64 with a cassette drive (not a disk drive) for storing data. Oh the joy!
I am wondering if there is an equivalent way to perform Peek and Poke commands in a .bat file. Is it even possible anymore to check a specific address the way it worked in BASIC language?
Can a batch file locate the address of something like whether or not the 'y' key has been pressed and can it also set the value of that address to indicate that key was pressed?
It used to be something like PEEK(64324) would return the value of that location. Likewise; POKE(64324) would set the value at that location.
I could run a loop that basically waited for a keyboard input and if it recieved the correect trigger at that address it would perform a command. e.g.
For x = 1 to 1000
If PEEK(64324) = 1 then exit
Next x
So when the 'y' key was pressed, the loop would exit or goto the next command. Can BATCH check a specific address for it's current state and if so, is there any repository or listing somewhere that tells what address is what for things like colors and keys on the keyboard?
In MSDOS you can use the DEBUG tool to get a dump of memory:
SHOWBIOS.BAT
ECHO:d FE00:0000 0040 >debug.txt
ECHO:q >>debug.txt
DEBUG < debug.txt > debug.out
You can run the memory dump thru a script
-d FE00:0000 0040
FE00:0000 41 77 61 72 64 20 53 6F-66 74 77 61 72 65 49 42 Award SoftwareIB
FE00:0010 4D 20 43 4F 4D 50 41 54-49 42 4C 45 20 34 38 36 M COMPATIBLE 486
FE00:0020 20 42 49 4F 53 20 43 4F-50 59 52 49 47 48 54 20 BIOS COPYRIGHT
FE00:0030 41 77 61 72 64 20 53 6F-66 74 77 61 72 65 20 49 Award Software I
-q
Times have changed, indeed, but in fact you could perhaps still do PEEKs and POKEs with the good old Motorola 68k family... because they like the 6502 used memory-mapped I/O.
I could be wrong, but I think computers today largely have abandoned memory-mapped I/O. Instead they'll do something like the Intel 8x86 family. It's been awhile since I took 8086 assembly, though.

Array manipulation in Perl

The Scenario is as follows:
I have a dynamically changing text file which I'm passing to a variable to capture a pattern that occurs throughout the file. It looks something like this:
my #array1;
my $file = `cat <file_name>.txt`;
if (#array1 = ( $file =~ m/<pattern_match>/g) ) {
print "#array1\n";
}
The array looks something like this:
10:38:49 788 56 51 56 61 56 59 56 51 56 80 56 83 56 50 45 42 45 50 45 50 45 43 45 54 10:38:51 788 56 51 56 61 56 59 56 51 56 80 56 83 56 50 45 42 45 50 45 50 45 43 45 54
From the above array1 output, the pattern of the array is something like this:
T1 P1 t1(1) t1(2)...t1(25) T2 P2 t2(1) t2(2)...t2(25) so on and so forth
Currently, /g in the regex returns a set of values that occur only twice (only because the txt file contains this pattern that number of times). This particular pattern occurrence will change depending on the file name that I plan to pass dynamically.
What I intend to acheive:
The final result should be a csv file that contains these values in the following format:
T1,P1,t1(1),t1(2),...,t1(25)
T2,P2,t2(1),t2(2),...,t2(25)
so on and so forth
For instance: My final CSV file should look like this:
10:38:49,788,56,51,56,61,56,59,56,51,56,80,56,83,56,50,45,42,45,50,45,50,45,43,45,54
10:38:51,788,56,51,56,61,56,59,56,51,56,80,56,83,56,50,45,42,45,50,45,50,45,43,45,54
The delimiter for this pattern is T1 which is time in the format \d\d:\d\d:\d\d
Example: 10:38:49, 10:38:51 etc
What I have tried so far:
use Data::Dumper;
use List::MoreUtils qw(part);
my $partitions = 2;
my $i = 0;
print Dumper part {$partitions * $i++ / #array1} #array1;
In this particular case, my $partitions = 2; holds good since the pattern occurrence in the txt file is only twice, and hence, I'm splitting the array into two. However, as mentioned earlier, the pattern occurrence number keeps changing according to the txt file I use.
The Question:
How can I make this code more generic to achieve my final goal of splitting the array into multiple equal sized arrays without losing the contents of the original array, and then converting these mini-arrays into one single CSV file?
If there is any other workaround for this other than array manipulation, please do let me know.
Thanks in advance.
PS: I considered Hash of Hashes and Array of Hashes, but that kind of a data structure did not seem to be healthy solution for the problem I'm facing right now.
As far as I can tell, all you need is splice, which will work fine as long as you know the record size and it's constant
The data you showed has 52 fields, but the description of it requires 27 fields per record. It looks like each line has T, P, and t1 .. t24, rather than ending at t25
Here's how it looks if I split the data into 26-element chunks
use strict;
use warnings 'all';
my #data = qw/
10:38:49 788 56 51 56 61 56 59 56 51 56 80 56 83 56 50 45 42 45 50 45 50 45 43 45 54 10:38:51 788 56 51 56 61 56 59 56 51 56 80 56 83 56 50 45 42 45 50 45 50 45 43 45 54
/;
while ( #data ) {
my #set = splice #data, 0, 26;
print join(',', #set), "\n";
}
output
10:38:49,788,56,51,56,61,56,59,56,51,56,80,56,83,56,50,45,42,45,50,45,50,45,43,45,54
10:38:51,788,56,51,56,61,56,59,56,51,56,80,56,83,56,50,45,42,45,50,45,50,45,43,45,54
If you wanted to use List::MoreUtils instead of splice, the the natatime function returns an iterator that will do the same thing as the splice above
Like this
use List::MoreUtils qw/ natatime /;
my $iter = natatime 26, #data;
while ( my #set = $iter->() ) {
print join(',', #set), "\n";
}
The output is identical to that of the program above
Note
It is very wrong to start a new shell process just to use cat to read a file. The standard method is to undefine the input record separator $/ like this
my $file = do {
open my $fh, '<', '<file_name>.txt' or die "Unable to open file for input: $!";
local $/;
<$fh>;
};
Or if you prefer you could use File::Slurper like this
use File::Slurper qw/ read_binary /;
my $file = read_binary '<file_name>.txt';
although you will probably have to install it as it is not a core module

comparing multiple column files using python3

input_file1:
a 1 33
a 34 67
a 68 78
b 1 99
b 100 140
c 1 70
c 71 100
c 101 190
input file2:
a 5 23
a 30 72
a 76 78
b 5 30
c 23 88
c 92 98
I want to compare these two files such that for every value of 'a' in file2 the two integers (boundary) fall in the range (boundaries) of 'a' in file1 or between two ranges.
Instead of storing values like this 'a 1 33', you can make one structure (like 'a:1:33') for your data while writing into file. So that it will become easy to read data also.
Then, you can read each line and can split it based on ':' separator and you can compare with another file easily.

Is there a pattern in these bitshifts?

I have some Nikon raw files (.nef) which were rendered useless during a USB transfer. However, the size seems fine and only a handful of bits are shifted - by a value of -0x20 (hex) or -32 (dec).
Some of the files could be recovered later with another Computer from the same Card and now I am searching for a solution to recover the other >100 files, which have the same error.
Is there a regular pattern? The offsets seem to be in intervals of 0x800 (2048 in dec).
Differences between the two files
1. /_LXA9414.dump: 13.703.892 bytes
2. /_LXA9414_broken.dump: 13.703.892 bytes
Offsets: hexadec.
84C00: 23 03
13CC00: B1 91
2FA400: 72 52
370400: 25 05
4B9400: AE 8E
641400: 36 16
701400: FC DC
75B400: 27 07
925400: BE 9E
A04C00: A8 88
AC2400: 2F 0F
11 difference(s) found.
Here are more diffs from other files:
http://pastebin.com/9uB3Hx43

Resources