How to find out Total size of the Hive Database - database

i have a database with 10 tables. all 10 tables data is stored in different different locations. out of 10 tables, some are managed tables and some are external tables.
some tables location is /apps/hive/warehouse/
some tables location is /warehouse/hive/managed/
some tables location is /warehouse/hive/external/
is there any way to find out total size of the database with out go into each location and find the size, any alternative?

The below query when run in the Hive Metastore DB would help you in getting the total size occupied by all the tables in Hive. Note: The results you get for this query would be 100% correct only if all the tables are having their stats updated. [This can be checked in the table - TABLE_PARAMS in Metastore DB that I have also mentioned below (How it works?.b)]
Steps:
1. Login into Hive Metastore DB and use the database that is used by hive. hive1 by default.
2. Once done, you can execute the below query to get the total size of all the tables in Hive in bytes. The query takes the sum of total size of all the Hive tables based on the statistics of the tables.
MariaDB [hive1]> SELECT SUM(PARAM_VALUE) FROM TABLE_PARAMS WHERE PARAM_KEY="totalSize";
+------------------+
| SUM(PARAM_VALUE) |
+------------------+
| 30376289388684 |
+------------------+
1 row in set (0.00 sec)```
3. Remember, the result derived above is for only one replication. 30376289388684 x 3 is the actual size in HDFS including the replication.
How it works?
a. Selecting a random table in Hive with id 5783 and name - test12345 from the TBLS table in Hive Metastore DB.
MariaDB [hive1]> SELECT * FROM TBLS WHERE TBL_ID=5783;
+--------+-------------+-------+------------------+-------+-----------+-------+-----------+---------------+--------------------+--------------------+----------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | LINK_TARGET_ID |
+--------+-------------+-------+------------------+-------+-----------+-------+-----------+---------------+--------------------+--------------------+----------------+
| 5783 | 1555060992 | 1 | 0 | hive | 0 | 17249 | test12345 | MANAGED_TABLE | NULL | NULL | NULL |
+--------+-------------+-------+------------------+-------+-----------+-------+-----------+---------------+--------------------+--------------------+----------------+
1 row in set (0.00 sec)
b. Checking the different parameters of the table in Hive Metastore table - TABLE_PARAMS for the same Hive table with id - 5783. The totalSize record indicates the total size occupied by this table in HDFS for one of its replica. The next point (c) which is the hdfs du -s can be compared to check this.
The param COLUMN_STATS_ACCURATE with the value true says the table's statistics property is set to true. You can check for tables with this value as false to see if there are any tables in Hive those might have missing statistics.
MariaDB [hive1]> SELECT * FROM TABLE_PARAMS
-> WHERE TBL_ID=5783;
+--------+-----------------------+-------------+
| TBL_ID | PARAM_KEY | PARAM_VALUE |
+--------+-----------------------+-------------+
| 5783 | COLUMN_STATS_ACCURATE | true |
| 5783 | numFiles | 1 |
| 5783 | numRows | 1 |
| 5783 | rawDataSize | 2 |
| 5783 | totalSize | 324 |
| 5783 | transient_lastDdlTime | 1555061027 |
+--------+-----------------------+-------------+
6 rows in set (0.00 sec)
c. hdfs du -s output of the same table from HDFS. 324 and 972 are the sizes of one and three replicas of the table data in HDFS.
324 972 /user/hive/warehouse/test12345
Hope this helps!

Related

BigQuery removes fields with Postgres Array during datastream ingestion

I have this table named student_classes:
| id | name | class_ids |
| ----| ---------| -----------|
| 1 | Rebecca | {1,2,3} |
| 2 | Roy | {1,3,4} |
| 3 | Ted | {2,4,5} |
name is type text / string
class_ids is type integer[]
I created a datastream from PostgreSQL to BigQuery (following these instructions), but when I looked at the table's schema in BigQuery the class_ids field was gone and I am not sure why.
I was expecting class_ids would get ingested into BigQuery instead of getting dropped.

Create/Update table in MS Access dynamically

EDIT:
Here's what I have: An Access database made up of 3 tables linked from SQL server. I need to create a new table in this database by querying the 3 source tables. Here are examples of the 3 tables I'm using:
PlanTable1
+------+------+------+------+---------+---------+
| Key1 | Key2 | Key3 | Key4 | PName | MainKey |
+------+------+------+------+---------+---------+
| 53 | 1 | 5 | -1 | Bikes | 536681 |
| 53 | 99 | -1 | -1 | Drinks | 536682 |
| 53 | 66 | 68 | -1 | Balls | 536683 |
+------+------+------+------+---------+---------+
SpTable
+----+---------+---------+
| ID | MainKey | SpName |
+----+---------+---------+
| 10 | 536681 | Wing1 |
| 11 | 536682 | Wing2 |
| 12 | 536683 | Wing3 |
+----+---------+---------+
LocTable
+-------+-------------+--------------+
| LocID | CenterState | CenterCity |
+--- ---+-------------+--------------+
| 10 | IN | Indianapolis |
| 11 | OH | Columbus |
| 12 | IL | Chicago |
+-------+-------------+--------------+
You can see the relationships between the tables. The NewMasterTable I need to create based off of these will look something like this:
NewMasterTable
+-------+--------+-------------+------+--------------+-------+-------+-------+
| LocID | PName | CenterState | Key4 | CenterCity | Wing1 | Wing2 | Wing3 |
+-------+--------+-------------+------+--------------+-------+-------+-------+
| 10 | Bikes | IN | -1 | Indianapolis | 1 | 0 | 0 |
| 11 | Drinks | OH | -1 | Columbus | 0 | 1 | 0 |
| 12 | Balls | IL | -1 | Chicago | 0 | 0 | 1 |
+-------+--------+-------------+------+--------------+-------+-------+-------+
The hard part for me is making this new table dynamic. In the future, rows may be added to the source tables. I need my NewMasterTable to reflect any changes/additions to the source. How do I go about building the NewMasterTable as described? Does this make any sort of sense?
Since an Access table is a necessary requirement, then probably the only way to go about it is to create a set of Update and Insert queries that are executed periodically. There is no built-in "dynamic" feature of Access that will monitor and update the table.
First, create the table. You could either 1) do this manually from scratch by defining the columns and constraints yourself, or 2) create a make-table query (i.e. SELECT... INTO) that generates most of the schema, then add any additional columns, edit necessary details and add appropriate indexes.
Define and save Update and Insert (and optional Delete) queries to keep the table synced. I'm not sharing actual code here, because that goes beyond your primary issue I think and requires specifics that you need to define. Due to some ambiguity with your key values (the field names and sample data still are not sufficient to reveal precise relationships and constraints), it is likely that you'll need multiple Update statements.
In particular, the "Wing" columns will likely require a transform statement.
You may not be able to update all columns appropriately using a single query. I recommend not trying to force such an "artificial" requirement. Multiple queries can actually be easier to understand and maintain.
In the event that you experience "query is not updateable" errors, you may need to define other "temporary" tables with appropriate indexes, into which you do initial inserts from the linked tables, then subsequent queries to update your master table from those.
Finally, and I think this is the key to solving your problem, you need to define some Access form (or other code) that periodically runs your set of "sync" queries. Access forms have a [Timer Interval] property and corresponding Timer event that fires periodically. Add VBA code in the Form_Timer sub that runs all your queries. I would suggest "wrapping" such VBA in a transaction and adding appropriate error handling and error logging, etc.

Why does Neo4j hit every indexed record when only returning a count?

I am using version 3.0.3, and running my queries in the shell.
I have ~58 million record nodes with 4 properties each, specifically an ID string, a epoch time integer, and lat/lon floats.
When I run a query like profile MATCH (r:record) RETURN count(r); I get a very quick response:
+----------+
| count(r) |
+----------+
| 58430739 |
+----------+
1 row
29 ms
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+--------------------------+----------------+------+---------+-----------+--------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+--------------------------+----------------+------+---------+-----------+--------------------------------+
| +ProduceResults | 7644 | 1 | 0 | count(r) | count(r) |
| | +----------------+------+---------+-----------+--------------------------------+
| +NodeCountFromCountStore | 7644 | 1 | 0 | count(r) | count( (:record) ) AS count(r) |
+--------------------------+----------------+------+---------+-----------+--------------------------------+
Total database accesses: 0
The Total database accesses: 0 and NodeCountFromCountStore tells me that neo4j uses a counting mechanism here that avoids iterating over all the nodes.
However, when I run profile MATCH (r:record) WHERE r.time < 10000000000 RETURN count(r);, I get a very slow response:
+----------+
| count(r) |
+----------+
| 58430739 |
+----------+
1 row
151278 ms
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+-----------------------+----------------+----------+----------+-----------+------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+-----------------------+----------------+----------+----------+-----------+------------------------------+
| +ProduceResults | 1324 | 1 | 0 | count(r) | count(r) |
| | +----------------+----------+----------+-----------+------------------------------+
| +EagerAggregation | 1324 | 1 | 0 | count(r) | |
| | +----------------+----------+----------+-----------+------------------------------+
| +NodeIndexSeekByRange | 1752922 | 58430739 | 58430740 | r | :record(time) < { AUTOINT0} |
+-----------------------+----------------+----------+----------+-----------+------------------------------+
Total database accesses: 58430740
The count is correct, as I chose a time value larger than all of my records. What surprises me here is that Neo4j is accessing EVERY single record. The profiler states that Neo4j is using the NodeIndexSeekByRange as an alternative method here.
My question is, why does Neo4j access EVERY record when all it is returning is a count? Are there no intelligent mechanisms inside the system to count a range of values after seeking the boundary/threshold value within the index?
I use Apache Solr for the same data, and returning a count after searching an index is extremely fast (about 5 seconds). If I recall correctly, both platforms are built on top of Apache Lucene. While I don't know much about that software internally, I would assume that the index support is fairly similar for both Neo4j and Solr.
I am working on a proxy service that will deliver results in a paginated form (using the SKIP n LIMIT m technique) by first getting a count, and then iterating over results in chunks. This works really well for Solr, but I am afraid that Neo4j may not perform well in this scenario.
Any thoughts?
The later query does a NodeIndexSeekByRange operation. This is going through all your matched nodes with the record label to look up the value of the node property time and does a comparison if its value is less than 10000000000.
This query actually has to get every node and read some info for comparison, and that's the reason why it is much slower.

MySQL Import into Innodb table severely spikes at a certain point

I'm trying to migrate a 30GB database from one server to another.
The short story is that at a certain point through the process, the amount of time it takes to import records severely increases as a spike. The following is from using the SOURCE command to import a chunk of 500k records (out of about ~25-30 million throughout the database) that was exported as an sql file that was ssh tunnelled over to the new server:
...
Query OK, 2871 rows affected (0.73 sec)
Records: 2871 Duplicates: 0 Warnings: 0
Query OK, 2870 rows affected (0.98 sec)
Records: 2870 Duplicates: 0 Warnings: 0
Query OK, 2865 rows affected (0.80 sec)
Records: 2865 Duplicates: 0 Warnings: 0
Query OK, 2871 rows affected (0.87 sec)
Records: 2871 Duplicates: 0 Warnings: 0
Query OK, 2864 rows affected (2.60 sec)
Records: 2864 Duplicates: 0 Warnings: 0
Query OK, 2866 rows affected (7.53 sec)
Records: 2866 Duplicates: 0 Warnings: 0
Query OK, 2879 rows affected (8.70 sec)
Records: 2879 Duplicates: 0 Warnings: 0
Query OK, 2864 rows affected (7.53 sec)
Records: 2864 Duplicates: 0 Warnings: 0
Query OK, 2873 rows affected (10.06 sec)
Records: 2873 Duplicates: 0 Warnings: 0
...
The spikes eventually average to 16-18 seconds per ~2800 rows affected. Granted I don't usually use Source for a large import, but for the sakes of showing legitimate output, I used it to understand when the spikes happen. Using mysql command or mysqlimport yields the same results. Even piping the results directly into the new database instead of through an sql file has these spikes.
As far as I can tell, this happens after a certain amount of records are inserted into a table. The first time I boot up a server and import a chunk that size, it goes through just fine. Give or take the estimated amount it handles until these spikes occur. I can't correlate that because I haven't consistently replicated the issue to evidently conclude that. There are ~20 tables that have sub 500,000 records that all imported just fine when those 20 tables were imported through a single command. This seems to only happen to tables that have an excessive amount of data. Granted, the solutions I've come cross so far seem to only address the natural DR that occurs when you import over time (The expected output in my case was that eventually at the end of importing 500k records, it would take 2-3 seconds per ~2800, whereas it seems the questions were addressing that at the end it shouldn't take that long). This comes from a single sugarCRM table called 'campaign_log', which has ~9 million records. I was able to import in chunks of 500k back onto the old server i'm migrating off of without these spikes occuring, so I assume this has to do with my new server configuration. Another thing is that whenever these spikes occur, the table that it is being imported into seems to have an awkward way of displaying the # of records via count. I know InnoDB gives count estimates, but the number doesn't succeed the ~, indicating the estimate. It usually is accurate and that each time you refresh the table, it doesn't change the amount it displays (This is based on what it reports through PHPMyAdmin)
Here's the following commands/InnoDB system variables I have on the new server:
INNODB System Vars:
+---------------------------------+------------------------+
| Variable_name | Value |
+---------------------------------+------------------------+
| have_innodb | YES |
| ignore_builtin_innodb | OFF |
| innodb_adaptive_flushing | ON |
| innodb_adaptive_hash_index | ON |
| innodb_additional_mem_pool_size | 8388608 |
| innodb_autoextend_increment | 8 |
| innodb_autoinc_lock_mode | 1 |
| innodb_buffer_pool_instances | 1 |
| innodb_buffer_pool_size | 8589934592 |
| innodb_change_buffering | all |
| innodb_checksums | ON |
| innodb_commit_concurrency | 0 |
| innodb_concurrency_tickets | 500 |
| innodb_data_file_path | ibdata1:10M:autoextend |
| innodb_data_home_dir | |
| innodb_doublewrite | ON |
| innodb_fast_shutdown | 1 |
| innodb_file_format | Antelope |
| innodb_file_format_check | ON |
| innodb_file_format_max | Antelope |
| innodb_file_per_table | OFF |
| innodb_flush_log_at_trx_commit | 1 |
| innodb_flush_method | fsync |
| innodb_force_load_corrupted | OFF |
| innodb_force_recovery | 0 |
| innodb_io_capacity | 200 |
| innodb_large_prefix | OFF |
| innodb_lock_wait_timeout | 50 |
| innodb_locks_unsafe_for_binlog | OFF |
| innodb_log_buffer_size | 8388608 |
| innodb_log_file_size | 5242880 |
| innodb_log_files_in_group | 2 |
| innodb_log_group_home_dir | ./ |
| innodb_max_dirty_pages_pct | 75 |
| innodb_max_purge_lag | 0 |
| innodb_mirrored_log_groups | 1 |
| innodb_old_blocks_pct | 37 |
| innodb_old_blocks_time | 0 |
| innodb_open_files | 300 |
| innodb_print_all_deadlocks | OFF |
| innodb_purge_batch_size | 20 |
| innodb_purge_threads | 1 |
| innodb_random_read_ahead | OFF |
| innodb_read_ahead_threshold | 56 |
| innodb_read_io_threads | 8 |
| innodb_replication_delay | 0 |
| innodb_rollback_on_timeout | OFF |
| innodb_rollback_segments | 128 |
| innodb_spin_wait_delay | 6 |
| innodb_stats_method | nulls_equal |
| innodb_stats_on_metadata | ON |
| innodb_stats_sample_pages | 8 |
| innodb_strict_mode | OFF |
| innodb_support_xa | ON |
| innodb_sync_spin_loops | 30 |
| innodb_table_locks | ON |
| innodb_thread_concurrency | 0 |
| innodb_thread_sleep_delay | 10000 |
| innodb_use_native_aio | ON |
| innodb_use_sys_malloc | ON |
| innodb_version | 5.5.39 |
| innodb_write_io_threads | 8 |
+---------------------------------+------------------------+
System Specs:
Intel Xeon E5-2680 v2 (Ivy Bridge) 8 Processors
15GB Ram
2x80 SSDs
CMD to Export:
mysqldump -u <olduser> <oldpw>, <olddb> <table> --verbose --disable-keys --opt | ssh -i <privatekey> <newserver> "cat > <nameoffile>"
Thank you for any assistance. Let me know if there's any other information I can provide.
I figured it out. I increased the innodb_log_file_size from 5MB to 1024MB. While it did significantly increase the amount of records I imported (Never went above 1 second per 3000 rows), it also fixed the spikes. There were only 2 in all the records I imported, but after they happened, they immediately went back to taking sub 1 second.

Why is this query returning unwanted results?

Good morning,
I have a problem with this query:
SELECT
P.txt_nome AS Pergunta,
IP.nome AS Resposta,
COUNT(*) AS Qtd
FROM
tb_resposta_formulario RF
INNER JOIN formularios F ON
F.id_formulario = RF.id_formulario
INNER JOIN tb_pergunta P ON
P.id_pergunta = RF.id_pergunta
INNER JOIN tb_resposta_formulario_combo RFC ON
RFC.id_resposta_formulario = RF.id_resposta_formulario
INNER JOIN itens_perguntas IP ON
IP.id_item_pergunta = RFC.id_item_pergunta
WHERE
RF.id_formulario = 2
GROUP BY
P.txt_nome,
IP.nome
This is the actual result of this query:
|Pergunta| Resposta |Qtd|
|Produto |Combo 1MB | 3 |
|Produto |Combo 2MB | 5 |
|Produto |Combo 4MB | 1 |
|Produto |Combo 6MB | 1 |
|Produto |Combo 8MB | 4 |
|Região |MG | 3 |
|Região |PR | 2 |
|Região |RJ | 3 |
|Região |SC | 1 |
|Região |SP | 5 |
These are the results I was expecting:
|Produto | Região |Qtd|
|Combo 1MB | MG | 3 |
|Combo 2MB | SP | 5 |
|Combo 4MB | SC | 1 |
|Combo 6MB | RJ | 1 |
|Combo 8MB | PR | 2 |
I am using the PIVOT and UNPIVOT operators but the result is not satisfactory.
Has anyone already faced this situation before? Do you have any insight you can offer?
I already analyzed these links:
SQL Server 2005 Pivot on Unknown Number of Columns
Transpose a set of rows as columns in SQL Server 2000
SQL Server 2005, turn columns into rows
Pivot Table and Concatenate Columns
PIVOT in sql 2005
Att,
Pelegrini
The "obvious" answer is: because the query is incorrect. We really know nothing about the table structure and what you're trying to achieve.
Concerning at least one very basic problem in your query: you're expecting the columns |Produto | Região |Qtd| in your response, yet the query unambiguously selects the columns Pergunta, Reposta and Qtd, which coincides with the result you're getting.
How well are you acquainted with SQL at all? It may be worth it to read an introductory text. I'd suggest this as a good introduction. (Uses Oracle, but the principles are the same)

Resources