HBase arrays and Hive - arrays

I'm trying to write Arrays into HBase using Hadoop's ArrayWritable class. I'm serializing a List into ArrayWritable, and then use WritableUtils.toByteArray to get bytes. While reading, I reconstruct List from ArrayWritable. Have added the code for this operation below. I'm able to correctly read/write List into the HBase DB.
However, when we try to create an external table on top of the written data, we're not able to see the stored arrays in the expected format. The list of strings appear as concatenated strings without any separator, and certainly not as Arrays or Strings.
PROBLEM: We need the HBase arrays to be visible in Hive in a feasible format that we can use. We'll also be exporting this data for querying to Redshift, and hence need something workable. How can I modify my read/write approach so that it works for both Java application and Hive/Redshift.
Method used for serialization and de-serialization:
public static ArrayWritable getWritableFromArray(List<String> stringList) {
Writable[] writables = new Writable[stringList.size()];
for (int i = 0; i < writables.length; i++) {
writables[i] = new Text(stringList.get(i));
}
return new ArrayWritable(Text.class, writables);
}
public static List<String> getListFromWritable(ArrayWritable arrayWritable) {
Writable[] writables = arrayWritable.get();
List<String> list = new ArrayList<>(writables.length);
for (Writable writable : writables) {
list.add(((Text) writable).toString());
}
return list;
}
Method for creating Hive table:
create external table testdata(
uid bigint
,city array<string>
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, d:city")
TBLPROPERTIES("hbase.table.name" = "testdata");
Querying from Hive table returns this:
select * from testdata;
OK
23821975838576221 ["\u0000\u0000\u0000\u0001\u0006raipur\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"]
808554262192775221 ["\u0000\u0000\u0000","\u0006indore\u0006raipur\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"]
2361689275801875221 ["\u0000\u0000\u0000","\u0006indore\u0006raipur\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"]
4897772799782875221 ["\u0000\u0000\u0000","\nchandigarh\u0006indore\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"]
When this data is exported to Redshift, the cities appear concatenated, as
indoreraipur
chandigarhindore
How can I fix this? Is trying to write a List directly into HBase a bad idea? Should I try to manually serialize and deserialize it and write it as a String instead of an Array type?

Related

Is there a way to map a postgres array to a list without creating an additional table?

I start my application creating a database model. And I cannot understand if there is any way in hibernate-jpa to save a List of strings as an array.
If I simply try to save my
List<String> someLines;
as sql
some_lines text[]
I get an error that array cannot be represented as a List.
If I put an annotation #ElementCollection like
#ElementCollection
List<String> someLines;
another table some_elements is expected. The same happens If I additionally put an annotation #Column:
#Column(name = "some_lines")
#ElementCollection
List<String> someLines;
How do you usually save a simple List in DB (without creating an additional table)?

OrientDB - find "orphaned" binary records

I have some images stored in the default cluster in my OrientDB database. I stored them by implementing the code given by the documentation in the case of the use of multiple ORecordByte (for large content): http://orientdb.com/docs/2.1/Binary-Data.html
So, I have two types of object in my default cluster. Binary datas and ODocument whose field 'data' lists to the different record of binary datas.
Some of the ODocument records' RID are used in some other classes. But, the other records are orphanized and I would like to be able to retrieve them.
My idea was to use
select from cluster:default where #rid not in (select myField from MyClass)
But the problem is that I retrieve the other binary datas and I just want the record with the field 'data'.
In addition, I prefer to have a prettier request because I don't think the "not in" clause is really something that should be encouraged. Is there something like a JOIN which return records that are not joined to anything?
Can you help me please?
To resolve my problem, I did like that. However, I don't know if it is the right way (the more optimized one) to do it:
I used the following SQL request:
SELECT rid FROM (FIND REFERENCES (SELECT FROM CLUSTER:default)) WHERE referredBy = []
In Java, I execute it with the use of the couple OCommandSQL/OCommandRequest and I retrieve an OrientDynaElementIterable. I just iterate on this last one to retrieve an OrientVertex, contained in another OrientVertex, from where I retrieve the RID of the orpan.
Now, here is some code if it can help someone, assuming that you have an OrientGraphNoTx or an OrientGraph for the 'graph' variable :)
String cmd = "SELECT rid FROM (FIND REFERENCES (SELECT FROM CLUSTER:default)) WHERE referredBy = []";
List<String> orphanedRid = new ArrayList<String>();
OCommandRequest request = graph.command(new OCommandSQL(cmd));
OrientDynaElementIterable objects = request.execute();
Iterator<Object> iterator = objects.iterator();
while (iterator.hasNext()) {
OrientVertex obj = (OrientVertex) iterator.next();
OrientVertex orphan = obj.getProperty("rid");
orphanedRid.add(orphan.getIdentity().toString());
}

Iterate over big resultset in batches (like foreach, but grouped)

I am using ScalikeJDBC to fetch a large table, convert the data to JSON and then calling a web service with 50 JSON objects (rows) each. This is my code:
val rows = sql"SELECT * FROM bigtable"
val jsons = rows.map { row =>
// build JSON object for each row
}.toList().apply()
jsons.grouped(50).foreach { batch =>
// send 50 objects at once to an HTTP server
}
This works, but unfortunately, the intermediate list is huge and consumes alot of memory. I am looking for a way to iterate over the resultset in a "lazy" fashion, similar to foreach, except I want to iterate over batches of 50 rows. Is that possible with ScalikeJDBC?
I solved the memory issues by filling and clearing a mutable list instead of using grouped, but I am still looking for a better solution.
Try specifying fetchSize.
See also: http://scalikejdbc.org/documentation/operations.html#setting-jdbc-fetchsize

store strings of arbitrary length in Postgresql

I have a Spring application which uses JPA (Hibernate) initially created with Spring Roo. I need to store Strings with arbitrary length, so for that reason I've annotated the field with #Lob:
public class MyEntity{
#NotNull
#Size(min = 2)
#Lob
private String message;
...
}
The application works ok in localhost but I've deployed it to an external server and it a problem with encoding has appeared. For that reason I'd like to check if the data stored in the PostgreSQL database is ok or not. The application creates/updates the tables automatically. And for that field (message) it has created a column of type:
text NOT NULL
The problem is that after storing data if I browse the table or just do a SELECT of that column I can't see the text but numbers. Those numbers seems to be identifiers to "somewhere" where that information is stored.
Can anyone tell me exactly what are these identifiers and if there is any way of being able to see the stored data in a #Lob columm from a pgAdmin or a select clause?
Is there any better way to store Strings of arbitrary length in JPA?
Thanks.
I would recommend skipping the '#Lob' annotation and use columnDefinition like this:
#Column(columnDefinition="TEXT")
see if that helps viewing the data while browsing the database itself.
Use the #LOB definition, it is correct. The table is storing an OID to the catalogs -> postegreSQL-> tables -> pg_largeobject table.
The binary data is stored here efficiently and JPA will correctly get the data out and store it for you with this as an implementation detail.
Old question, but here is what I found when I encountered this:
http://www.solewing.org/blog/2015/08/hibernate-postgresql-and-lob-string/
Relevant parts below.
#Entity
#Table(name = "note")
#Access(AccessType.FIELD)
class NoteEntity {
#Id
private Long id;
#Lob
#Column(name = "note_text")
private String noteText;
public NoteEntity() { }
public NoteEntity(String noteText) { this.noteText = noteText }
}
The Hibernate PostgreSQL9Dialect stores #Lob String attribute values by explicitly creating a large object instance, and then storing the UID of the object in the column associated with attribute.
Obviously, the text of our notes isn’t really in the column. So where is it? The answer is that Hibernate explicitly created a large object for each note, and stored the UID of the object in the column. If we use some PostgreSQL large object functions, we can retrieve the text itself.
Use this to query:
SELECT id,
convert_from(loread(
lo_open(note_text::int, x'40000'::int), x'40000'::int), 'UTF-8')
AS note_text
FROM note

Select from table using XML column

I am creating a task-scheduler on SQL Server 2008.
I have a table that I use to store tasks. Each task is a task name (e.g. ImportFile) and arguments. I store arguments in XML column, since different tasks have different signatures.
Table is as follows:
Id:integer(PK) | operation:nvarchar | Arguments:xml
Before queuing a task, I often need to verify that given task hasn't been scheduled yet. The lookup is done based on both operation and args.
Question: Using Linq-to-Sql how can I check if given operation+args is present in the queue already?
I am looking for something like:
var isTaskScheduled = db.Tasks.Any(t =>
t.Opearation == task.Operation &&
t.Arguments == task.ArgumentsAsXElement);
(which doesn't work because SQL Server can't compare XML type)
Any alternative implementation suggestions?
You might want to surface e.g. a string property that encapsultes your Arguments, or maybe it would be sufficient to have e.g. the length and a CRC of your Arguments as extra properties on your class:
public partial class Task
{
public int ArgumentLength
{ .... }
public int ArgumentCRC
{ .... }
}
That way, if you can compare length (of your XML) and the CRC and they match, you can be pretty sure and safe to assume the two XML's are identical. Your check would then be something like:
var isTaskScheduled =
db.Tasks.Any(t => t.Operation == task.Operation &&
t.ArgumentLength == task.ArgumentLength &&
t.ArgumentCRC == task.ArgumentCRC);
or something like that.
This may be a stretch, but you could use a "Hashcode" when saving the data to the database, then query on the hashcode value at a later date / time.
This assumes that you have a class that represents your task entity and that you have overridden the GetHashCode method of said class.
Now, when you go to query the database to see if the task is in the scheduled queue, you simply query on the hashcode, thus avoiding the need to do any xml poking at query time.
var t1 = new Task{Operation="Run", Arguments="someXElement.value"};
var t2 = new Task{Operation="Run", Arguments="someXElement.value"};
in the code above t1 == t2 because you are overriding GetHashCode and computing the hash for Operation+Arguments.Value. if you store the hashcode in the db, then you can easily tell if you have an object in the DB that equals the hash code that you are checking for.
This may be similar to what marc_s was talking about.
You can write a class which implements IComparable:
public class XMLArgument : IComparable
{
public XMLArgument(string argument)
{
}
public int CompareTo(object obj)
{
...
}
}
var isTaskScheduled = db.Tasks.Any(t =>
t.Opearation == task.Operation &&
(new XMLArgument(t.Arguments)).CompareTo(new XMLArgument(task.ArgumentsAsXElement)) == 0);

Resources