I have a table in Cassandra, in which one column is a BLOB.
I wish to update only some values in that blob. Is that possible ?
Example :
String form of BLOB is let's say:
{"name":"ABC","rollNum": "1234"}
I want to make it as :
{"name":"ABC","rollNum": "1333"} with an CQL update query.
Originally this column gets update from my JAVA code where I send byte[] to be inserted in this BLOB column.
Now, I want to update just some fields without doing any type of select on this row.
You can't do this in general.
Cassandra as any other database does not know how to interpret your blob. You will
need to read, parse, update and save your blob again
use a map instead
use single fields - which will give the most performance
Apart from that, updates like you want to do can be archived in document databases like MongoDB.
Related
I am trying to setup an ELT pipeline into Snowflake and it involves a transformation after loading.
This transformation will currently create or replace a Table using data queried from a source table in Snowflake after performing some manipulations of JSON data.
My question is, is this the proper way of doing it via create or replace Table everytime the transformation runs or is there a way to update the data in the transformed table incrementally?
Any advise would be greatly appreciated!
Thanks!
You can Insert into the load (soruce) table, and put into a stream, then you can know the rows, ranges of rows that need to be "reviewed" and then upsert into the output transform table.
That is is you doing something like "daily aggregates", thus if in "this batch you have data for the last 4 days, you then read the "last four days" of data from source (space a full read) and then aggregate and upsert via merge command. Thus with the model you can save reads/aggregate/write.
We have also used high water tables, to know last seen data, and/or lowest value in current batch.
The "requirements" are something like this: as a user I want to upload a dataset in CSV, JSON, or other supported text formats and be able to do basic REST queries against it such as selecting all first names in the dataset or select the first 10 rows.
I'm struggling to think of the "best" way to store this data. While, off the bat I don't think this will generate millions of datasets, it seems generally bad to create a new table for every dataset for a user as, eventually, I would hit the inode limit. I could store as flat files in something like S3 that's cached but then it still does require opening and parsing the file to query it.
Is this a use case for the JSON type in Postgres? If not, what would be the "right" format and place to store this data?
I have successfully downloaded twitter data through flume directly into HBase table containing one column family and all of the data is stored in one column like this
hbase(main):005:0> scan 'tweet'
ROW
default00fbf898-6f6e-4b41-aee8-646efadfba46
COLUMN+CELL
column=data:pCol, timestamp=1454394077534, value={"extended_entities":{"media":[{"display_url":"pic.twitter.com/a7Mjq2daKZ","source_user_id":2987221847,"type":"photo"....
Now i want to access structs and arrays through HBase like we can access then in Hive. I have tried googling the issue but still clue less. Kindly Help
You can't query display_url , source_user_id or another json fields in hbase directly. You should use a document store nosql db like mongodb.
I receive new data files every day. Right now, I'm building the database with all the required tables to import the data and perform the required calculations.
Should I just append each new day's data to my current tables? Each file contains a date column, which would allow for a "WHERE" query in the future if I need to analyze data for one particular day. Or should I be creating a new set of tables for every day?
I'm new to database design (coming from Excel). I will be using SQL Server for this.
Assuming that the structure of the data being received is the same, you should only need one set of tables rather than creating new tables each day.
I'd recommend storing the value of the date column from your incoming data in your database, and also having a 'CreateDate' column in your tables, with a default value of 'GetDate()' so that it automatically gets populated with the current date when the row is inserted.
You may also want to have another column to store the data filename that the row was imported from, but if you're already storing the value of the date column and the date that the row was inserted, this shouldn't really be necessary.
In the past, when doing this type of activity using a custom data loader application, I've also found it useful to create log files to log success/error/warning messages, including some type of unique key of the source data and target database - ie. if coming from an Excel file and going into a database column, you could store the row index from Excel and the primary key of the inserted row. This helps tracking down any problems later on.
You might want to consider having a look at SSIS (SqlServer Integration Services). It's the SqlServer tool for doing ETL activities.
yes, append each day's data to the tables; 1 set of tables for all data.
yes, use a date column to identify the day that the data was loaded.
maybe have another table with a date column and a clob column. The date to contain the load date and the clob to contain the file that you imported.
Good question. You most definitely should have a single set of tables and append the data daily. Consider this: if you create a new set of tables each day, what would, say, a monthly report query look like? A quarterly report query? It would be a mess, with UNIONs and JOINs all over the place.
A single set of tables with a WHERE clause makes the querying and reporting manageable.
You might do a little reading on relational database theory. Wikipedia is a good place to start. The basics are pretty straightforward if you have the knack for it.
I would have the data load into a stage table regardless and append to the main tables after. Once a week i would then refresh all data in the main table to ensure that the data remains correct as per the source.
Marcus
I have the schema with 10 fields. One of the fields is text(content of a file) , rest all the fields are custom metadata. Document doesn't chnages but the metadata changes frequently .
Is there any way to skip the Document(text) while re-indexing. Can I only index only custom metadata? If I skip the Document(text) in re-indexing , does it update the index file by removing the text field from the Index document?
To my knowledge there's no way to selectively update specific fields. An update operation performs a complete replace of all document data. Since Solr is open source, it's possible that you could produce your own component for this if really desired.