Flink-SQL: Extract values from nested objects - apache-flink

i'm using Flink SQL and the following scheme shows my source data (belonging to some Twitter data):
CREATE TABLE `twitter_raw` (
`entities` ROW(
`hashtags` ROW(
`text` STRING,
`indices` INT ARRAY
) ARRAY,
`urls` ROW(
`indices` INT ARRAY,
`url` STRING,
`display_url` STRING,
`expanded_url` STRING
) ARRAY,
`user_mentions` ROW(
`screen_name` STRING,
`name` STRING,
`id` BIGINT
) ARRAY
)
)
WITH (...);
I want to get only the hashtags in a collection. Therefore i have to map the collection of constructed objects (ROW) to an array of STRING.
Like this scheme:
CREATE TABLE `twitter_raw` (
`entities` ROW(
`hashtags` STRING ARRAY,
`urls` STRING ARRAY,
`user_mentions` STRING ARRAY
)
)
WITH (...);
How can i achieve this with Flink-SQL? Maybe built-in functions (JSON-functions?) or own UDF or do i have to write a DataStream Job?
Thanks in advance.

The SQL command UNNEST helps in this case. It is like EXPLODE in Spark.
You can solve it by creating a new row for each hashtag in the hashtags array:
SELECT hashtag, index
FROM twitter_raw
CROSS JOIN UNNEST(hashtags) AS t (hashtag, index)

You can either define a computed row, or a VIEW, and then extracting the hashtags field using the dot notation. e.g.:
CREATE VIEW hashtags_raw (hashtags) AS
SELECT entities.hashtags AS hashtags FROM twitter_raw

Related

How to Query in MongoDB using one Key in c

I am attempting to search for a key's value in MongoDB via my .c program. essentially In my collection in one of my documents: I have key:value, i want to be able to return said value by passing in the key. I have seen
query = bson_new ();
BSON_APPEND_UTF8 (query, "hello", "world");
cursor = mongoc_collection_find_with_opts( collection, query, NULL, NULL);
I want to be able to search using only hello and return world.
To query a value in MongoDB, with only one key and pair (ex. Key:Value). Regex can be used.
bson_t *query;
bson_append_regex(query, key, -1/* length of key*/, ".", NULL);
cursor = mongoc_collection_find_with_opts(collection, query, NULL, NULL);
The string "." in regex signifies 'any string', thus if you would need to find json value within another key such as:
{ key:
{
second_key:value
}
}
Changing the Regex to "second_key." would locate desired value.
Note
This will return any other values that may lay within, paired with mongoc_cursor_t * you can select the desired value you are looking for.

How to concatenate the elements of int array to string in Hive

I'm trying to concatenate the element of int array to one string in hive.
The function concat_ws works only for string arrays, so I tried cast(my_int_array as string) but it's not working.
Any suggestion?
Try to transform using /bin/cat:
from mytable select transform(my_int_array) using '/bin/cat' as (my_int_array);
Second option is to alter table and replace delimiters:
1) ALTER TABLE mytable CHANGE COLUMN my_int_array = my_int_array_string string;
2) SELECT REPLACE(my_int_array_string, '\002', ', ') FROM mytable;
It seems that the easiest way is to write a custom UDF to perform this specific task:
public class ConcatIntArray extends UDF {
public String evaluate(ArrayList<Integer> in, final String delimiter){
return in.stream().map(u-> String.valueOf(u)).collect(Collectors.joining(delimiter));
}
}

How to pass Array[Long] ids into Aerospike Record UDF?

I store many sub records as Map in the same top record, now I need remove some sub records according to passing keys, I plan to use UDF to implement it. When I invoked from Java, the log shows the length is 0. Why? and how to correct this? Thanks!
In Java =>
val ids = Array(1,2,3)
Value.get(ids)
In Lua UDF =>
function remove_keys(rec, binName, rmkeys)
info("Value(%s) valType(%s)", tostring(rec), type(rec));
info("Value(%s) valType(%s)", tostring(binName), type(binName));
info("Value(%s) valType(%s)", tostring(rmkeys), type(rmkeys));
info("BinName(%s)", binName)
info("len(%s)", tostring(#rmkeys))
...
end
Try:
String binName = "binName";
List<Value> rmKeys = new ArrayList<Value>();
rmKeys.add(Value.get("key1"));
client.execute(policy, key, packageName, "remove_keys", Value.get(binName), Value.get(rmKeys));

Dapper Results(Dapper Row) with Bracket Notation

According to the Dapper documentation, you can get a dynamic list back from dapper using below code :
var rows = connection.Query("select 1 A, 2 B union all select 3, 4");
((int)rows[0].A)
.IsEqualTo(1);
((int)rows[0].B)
.IsEqualTo(2);
((int)rows[1].A)
.IsEqualTo(3);
((int)rows[1].B)
.IsEqualTo(4);
What is however the use of dynamic if you have to know the field names and datatypes of the fields.
If I have :
var result = Db.Query("Select * from Data.Tables");
I want to be able to do the following :
Get a list of the field names and data types returned.
Iterate over it using the field names and get back data in the following ways :
result.Fields
["Id", "Description"]
result[0].values
[1, "This is the description"]
This would allow me to get
result[0].["Id"].Value
which will give results 1 and be of type e.g. Int 32
result[0].["Id"].Type --- what datattype is the value returned
result[0].["Description"]
which will give results "This is the description" and will be of type string.
I see there is a results[0].table which has a dapperrow object with an array of the fieldnames and there is also a result.values which is an object[2] with the values in it, but it can not be accessed. If I add a watch to the drilled down column name, I can get the id. The automatically created watch is :
(new System.Collections.Generic.Mscorlib_CollectionDebugView<Dapper.SqlMapper.DapperRow>(result as System.Collections.Generic.List<Dapper.SqlMapper.DapperRow>)).Items[0].table.FieldNames[0] "Id" string
So I should be able to get result[0].Items[0].table.FieldNames[0] and get "Id" back.
You can cast each row to an IDictionary<string, object>, which should provide access to the names and the values. We don't explicitly track the types currently - we simply don't have a need to. If that isn't enough, consider using the dapper method that returns an IDataReader - this will provide access to the raw data, while still allowing convenient call / parameterization syntax.
For example:
var rows = ...
foreach(IDictionary<string, object> row in rows) {
Console.WriteLine("row:");
foreach(var pair in row) {
Console.WriteLine(" {0} = {1}", pair.Key, pair.Value);
}
}

Unsetting elements of a cakephp generated array

I have an array called $all_countries following this structure:
Array
(
[0] => Array
(
[countries] => Array
(
[id] => 1
[countryName] => Afghanistan
)
)
[1] => Array
(
[countries] => Array
(
[id] => 2
[countryName] => Andorra
)
)
)
I want to loop through an array called prohibited_countries and unset the entire [countries] element that has a countryName matching.
foreach($prohibited_countries as $country){
//search the $all_countries array for the prohibited country and remove it...
}
Basically I've tried using an array_search() but I can't get my head around it, and I'm pretty sure I could simplify this array beforehand using Set::extract or something?
I'd be really grateful if someone could suggest the best way of doing this, thanks.
Here's an example using array_filter:
$all_countries = ...
$prohibited_countries = array('USA', 'England'); // As an example
$new_countries = array_filter($all_countries, create_function('$record', 'global $prohibited_countries; return !in_array($record["countries"]["countryName"], $prohibited_countries);'));
$new_countries now contains the filtered array
Well first of all id e teh array in the format:
Array(
'Andorra' => 2,
'Afghanistan' => 1
);
Or if you need to have the named keys then i would do:
Array(
'Andorra' => array('countryName'=> 'Andorra', 'id'=>2),
'Afghanistan' => array('countryName'=> 'Afghanistan', 'id'=>1)
);
then i would jsut use an array_diff_keys:
// assuming the restricted and full list are in the same
// array format as outlined above:
$allowedCountries = array_diff_keys($allCountries, $restrictedCountries);
If your restricted countries are just an array of names or ids then you can use array_flip, array_keys, and/or array_fill as necessary to get the values to be the keys for the array_diff_keys operation.
You could also use array_map to do it.
Try something like this (it's probably not the most efficient way, but it should work):
for ($i = count($all_countries) - 1; $i >= 0; $i--) {
if (in_array($all_countries[$i]['countries']['countryName'], $prohibited_countries) {
unset($all_countries[$i]);
}
}
If you wanted to use the Set class included in CakePHP, you could definitely reduce the simplicity of your country array with Set::combine( array(), key, value ). This will reduce the dimensionality (however, you could do this differently as well. It looks like your country array is being created by a Cake model; you could use Model::find( 'list' ) if you don't want the multiple-dimension resultant array... but YMMV).
Anyway, to solve your core problem you should use PHP's built-in array_filter(...) function. Manual page: http://us3.php.net/manual/en/function.array-filter.php
Iterates over each value in the input
array passing them to the callback
function. If the callback function
returns true, the current value from
input is returned into the result
array. Array keys are preserved.
Basically, pass it your country array. Define a callback function that will return true if the argument passed to the callback is not on the list of banned countries.
Note: array_filter will iterate over your array, and is going to be much faster (execution time-wise) than using a for loop, as array_filter is a wrapper to an underlying C function. Most of the time in PHP, you can find a built-in to massage arrays for what you need; and it's usually a good idea to use them, just because of the speed boost.
HTH,
Travis

Resources