I have a question regarding importing a CSV file which has some of its fields are objects, I have created the types for those objects but I don't know how to import the CSV's objects' data to cassandra types.
For an example, I have a house table that has id, name and pet(pet is an object). I have create a type for pet(name, age), in my CSV file, it has 2 columns called pet.name and pet.age and I want to import those data to the pet type. Am I able to do that? Sorry I'm new to Cassandra.
Thank you
You can use the DataStax Bulk Loader tool (DSBulk) to bulk load data in CSV format to a Cassandra table. It supports loading data into columns with user-defined types (UDTs).
Here are some references with examples to help you get started quickly:
Blog - DSBulk Intro + Loading data
Blog - More DSBulk Loading examples
Blog - Counting records with DSBulk
Docs - Loading data examples
Answered questions - DS Community
DSBulk is open-source so it's free to use. Cheers!
Related
I am working on a project to create a simplified version of SQLite Database. I got stuck when trying to figure out how does it manages to store data of multiple tables with different schema, in a single file. I suppose it should be using some indexes to map the data of different tables. Can someone shed more light on how its actually done? Thanks.
Edit: I suppose there is already an explanation in the docs, but looking for some easier way to understand it better and faster.
The schema is the list of all entities (tables, views etc) (the database as a whole) rather than a database existing of many schemas on a per entity basis.
Data itself is stored in pages each page being owned by an entity. It is these blocks that are saved.
The default page size is 4k. You will notice that the file size will always be a mutliple of 4K. You could also, with experimentation create a database with some tables, note it's size, then add some data, and if the added data does not require another page, see that the size of the file is the same. This demonstrating how it's all about pages rather than a linear/contiguos stream of data.
It, the schema, is saved in a table called sqlite_master. This table has columns :-
type (the type e.g. table etc),
name (the name given to the entity),
tbl_name (the tale to which the entity applies )
root page (the map to the first page)
sql (the SQL used to generate the entity, if any)
note that another schema, sqlite_temp_master, may also exist if there are temporary tables.
For example :-
Using SELECT * FROM sqlite_master; could result in something like :-
2.6. Storage Of The SQL Database Schema
I want export some data from opentsdb,then import it into DolphinDB.
In opentsdb, the metrics are device_id,ssid, the tags are battery_level,battery_status,battery_temperature,bssid,cpu_avg_1min,cpu_avg_5min,cpu_avg_15min,mem_free,mem_used and rssi.
In DolphinDB , I create a table as bellow,
COLS_READINGS = `time`device_id`battery_level`battery_status`battery_temperature`bssid`cpu_avg_1min`cpu_avg_5min`cpu_avg_15min`mem_free`mem_used`rssi`ssid
TYPES_READINGS = `DATETIME`SYMBOL`INT`SYMBOL`DOUBLE`SYMBOL`DOUBLE`DOUBLE`DOUBLE`LONG`LONG`SHORT`SYMBOL
schema_readings = table(COLS_READINGS, TYPES_READINGS)
I find that the csv text file can import into DolphinDB, but I don't know how to export data to csv text file in Opentsdb. Is there a easy way to finish this work?
Assuming you're using an HBase backend, the easiest way would be to access that directly. The OpenTSDB schema describes in detail how to get the data you need.
The data is stored in one big table, but to save space, all metric names, tag keys and tag values are referenced using UIDs. These UIDs can be looked up in the UID table which stores that mapping in both directions.
You can write a small exporter in a language of your choice. The OpenTSDB code comes with an HBase client library, asynchbase, and has some tools to parse the raw data in its Internal class which can make it a bit easier.
I have a spreadsheet of data that I would like to convert to 2sxc data items.
Are there any tools or suggested approaches to doing this? (I can get the data into a SQL table if that would help.)
Thanks.
The answer is, you can import and export data type and items to and from XML.
I have successfully downloaded twitter data through flume directly into HBase table containing one column family and all of the data is stored in one column like this
hbase(main):005:0> scan 'tweet'
ROW
default00fbf898-6f6e-4b41-aee8-646efadfba46
COLUMN+CELL
column=data:pCol, timestamp=1454394077534, value={"extended_entities":{"media":[{"display_url":"pic.twitter.com/a7Mjq2daKZ","source_user_id":2987221847,"type":"photo"....
Now i want to access structs and arrays through HBase like we can access then in Hive. I have tried googling the issue but still clue less. Kindly Help
You can't query display_url , source_user_id or another json fields in hbase directly. You should use a document store nosql db like mongodb.
My job requires that I look up information on a long spreadsheet that's updated and sent to me once or twice a week. Sometimes the newest spreadsheet leaves off information that was in the last spreadsheet causing me to have to look through several different spreadsheets to find the info I need. I recently discovered that I could convert the spreadsheet to a CSV file and then upload it to a database table. With a few lines of script all I have to do is type in what I'm looking for and Voila! Now I just got the newest spreadsheet and I'm wondering if I can just Import it on top of the old one. There is a unique number for each row that I have set to primary in the database. If I try to import it on top of the current info will it just skip the rows where the primary would be duplicated or would it just mess up my database?
Thought I'd ask the experts before I tried it. Thanks for your input!
Details:
the spreadsheet consists of clients of ours. Each row contains the client's name, a unique id number, their address and contact info. I can set the row containing the unique ID to primary, then upload it. My concern is that there is nothing to signify a new row in a csv file (i think). when I upload it it it gives me the option to skip duplicates but will it skip the entire row or just that cell causing my data to be placed in the wrong rows.. It's apache server IDK what versions of mysql. I'm using 000webhost for this.
Higgs,
This issue in database/ETL terminology is called deduplication strategy.
There is not a template answer for this, but I suggest these helpful readings:
Academic paper - Joint Deduplication of Multiple Record Types
in Relational Data
Deduplication article
Some open source tools:
Duke tool
Data cleaner
there's a little checkbox when you click on import near the bottom that says 'ignore duplicates' or something like that. simpler than i thought.