Finding the histogram size for a given table PostgreSQL - database

I have a PostgreSQL table named census. I have performed the ANALYSE command on the table, and the statistics are recorded in pg_stats.
There are other entries in this pg_stats from other database tables as can be expected.
However, I wanted to know the space consumed for storing the histogram_bounds for the census table alone. Is there a good and fast way for it?
PS: I have tried dumping the pg_stats table onto the disk to measure the memory using
select * into copy_table from census where tablename='census';
However, it failed because of the pseudo-type anyarray.
Any ideas there too?

In the following I use the table pg_type and its column typname for demonstration purposes. Replace these with your table and column name to get the answer for your case (you didn't say which column you are interested in).
You can use the pg_column_size function to get the size of any column:
SELECT pg_column_size(histogram_bounds)
FROM pg_stats
WHERE schemaname = 'pg_catalog'
AND tablename = 'pg_type'
AND attname = 'typname';
pg_column_size
----------------
1269
(1 row)
To convert an anyarray to a regular array, you can first cast it to text and then to the desired array type:
SELECT histogram_bounds::text::name[] FROM pg_stats ...
If you measure the size of that converted array, you'll notice that it is much bigger than the result above.
The reason is that pg_column_size measures the actual size on disk, and histogram_bounds is big enough to be stored out of line in the TOAST table, where it will be compressed. The converted array is not compressed, because it is not stored in a table.

Related

Best way to search a list of 10,000 words in long text in T-SQL? Cursor is too slow and fails due with Memory Error

I have a table of 10,000 words (rows) with format NVARCHAR(20) that I need to search in another table (mymessagetable) with long text (i.e. NVARCHAR(MAX)) in T-SQL (SQL Server 2012). The size of mymessagetable can be between 30k to 100k. I need to know what is the best way to search these key words in a list of long message that is mymessagetable.
Right now I am using a cursor and it works well with small table like 4000. But as I put in table with 10k rows, it gives memory error after few thousands searches. Even the result is zero rows return. In future I might have a bigger list like 100k and i won't be having zero rows return always. Is there a better way to get this job done. I was thinking to have a where clause to break this task in chunks but this will be manual work.
SQL command I am using in the cursor (that is loop of my words list(10k))
SELECT message from mymessagetable where message like '%mytext%'
Why I am doing this:
I am trying to check if these messages from mymessagetable has anything to do with my list of keys.
Using a cursor should be the first sign that something is wrong!
You can do this set-based:
SELECT mymessagetable.message
, mywords.word
FROM mymessagetable
INNER
JOIN mywords
ON mymessagetable.message LIKE '%' + mywords.word + '%'
;

Issue with datatype Money in SQL SERVER vs string

I have a spreadsheet that gets all values loaded into SQL Server. One of the fields in the spreadsheet happens to be money. Now in order for everything to be displayed correcctly - i added a field in my tbl with Money as DataType.
When i read the value from spreadsheet I pretty much store it as a String, such as this... "94259.4". When it get's inserted in sql server it looks like this "94259.4000". Is there a way for me to basically get rid of the 0's in the sql server value when I grab it from DB - because the issue I'm running across is that - even though these two values are the same - because they are both compared as Strings - it thinks that there not the same values.
I'm foreseeing another issue when the value might look like this...94,259.40 I think what might work is limiting the numbers to 2 after the period. So as long as I select the value from Server with this format 94,259.40 - I thin I should be okay.
EDIT:
For Column = 1 To 34
Select Case Column
Case 1 'Field 1
If Not ([String].IsNullOrEmpty(CStr(excel.Cells(Row, Column).Value)) Or CStr(excel.Cells(Row, Column).Value) = "") Then
strField1 = CStr(excel.Cells(Row, Column).Value)
End If
Case 2 'Field 2
' and so on
I go through each field and store the value as a string. Then I compare it against the DB and see if there is a record that has the same values. The only field in my way is the Money field.
You can use the Format() to compare strings, or even Float For example:
Declare #YourTable table (value money)
Insert Into #YourTable values
(94259.4000),
(94259.4500),
(94259.0000)
Select Original = value
,AsFloat = cast(value as float)
,Formatted = format(value,'0.####')
From #YourTable
Returns
Original AsFloat Formatted
94259.40 94259.4 94259.4
94259.45 94259.45 94259.45
94259.00 94259 94259
I should note that Format() has some great functionality, but it is NOT known for its performance
The core issue is that string data is being used to represent numeric information, hence the problems comparing "123.400" to "123.4" and getting mismatches. They should mismatch. They're strings.
The solution is to store the data in the spreadsheet in its proper form - numeric, and then select a proper format for the database - which is NOT the "Money" datatype (insert shudders and visions of vultures circling overhead). Otherwise, you are going to have an expanding kluge of conversions between types as you go back and forth between two improperly designed solutions, and finding more and more edge cases that "don't quite work," and require more special cases...and so on.

Storage issue with Key,Value datatypes, particularly Hstores in Postgresql

Say I have a table with 3 columns: varchar(20), hstore, smallint
Now if I insert the following: "ABCDEF", "abc=>123, xyz=>888, lmn=>102", 5
How much space will the record take in PostgreSQL? Is the hstore stored as plain text?
So if I have a million records, the space taken by the keys (abc,xyz,lmn) will be duplicated across all the records?
I'm asking this because I have a use case in which I need to store an unknown number of key,value pairs; with each key taking upto 20B and the value not more than smallint range.
The catch is that the number of records is massive, around 90 million a day. And the number of Key,Value pairs are ~400. This quickly leads storage problems since just a day's data would total upto around 800GB; with a massive percentage being taken by the Keys which are duplicated across all records.
So considering there are 400 Key/Value pairs, a single Hstore in a record (if stored as plain text) would take 400*22 Bytes. Multiplied by 90 Million, that is 737GB.
If stored in normal columns as 2 Byte ints, it will take just 67GB.
Are HStores not suitable for this use case? Do I have any option which can help me with this storage issue? I know this is a big ask and I might just have to go with a regular columnar storage solution and forgo the flexibility offered by key value.
How much space will the record take in PostgreSQL?
To get the raw uncompressed size:
SELECT pg_column_size( ROW( 'ABCDEF', 'abc=>123, xyz=>888, lmn=>102'::hstore, 5) );
but due to TOAST compressed out-of-line storage that might not be the on-disk size... though it often is:
CREATE TABLE blah(col1 text, col2 hstore, col3 integer);
INSERT INTO blah (col1, col2, col3)
VALUES ('ABCDEF', 'abc=>123, xyz=>888, lmn=>102'::hstore, 5);
regress=> SELECT pg_column_size(blah) FROM blah;
pg_column_size
----------------
84
(1 row)
If you used a bigger hstore value here it might get compressed and stored out of line. In that case, the size would depend on how compressible it is.
Is the hstore stored as plain text?
no, it's a binary format, but nor is it compressed; the keys/values are plain text.
So if I have a million records, the space taken by the keys (abc,xyz,lmn) will be duplicated across all the records?
Correct. Each hstore value is a standalone value. It has no relationship with any other value anywhere in the system. It's just like a text or json or whatever else. There's no sort of central key index or anything like that.
Demo:
CREATE TABLE hsdemo(hs hstore);
INSERT INTO hsdemo(hs)
SELECT hstore(ARRAY['thisisthefirstkey', 'thisisanotherbigkey'], ARRAY[x::text, x::text])
FROM generate_series(1,10000) x;
SELECT pg_size_pretty(pg_relation_size('hsdemo'::regclass));
-- prints 992kb
INSERT INTO hsdemo(hs)
SELECT hstore(ARRAY['thisisthefirstkey', 'thisisanotherbigkey'], ARRAY[x::text, x::text])
FROM generate_series(10000,20000) x;
SELECT pg_size_pretty(pg_relation_size('hsdemo'::regclass));
-- prints 1968kb, i.e. near doubling for double the records.
Thus, if you have many highly duplicated large keys and small values, you should probably look at a normalized schema (yes, even EAV).
However, be aware that PostgreSQL has quite a large per-row overhead of over 20 bytes per row, so you may not gain as much as you'd expect by storing huge numbers of short rows instead of something like a hstore.
You can always compromise - keep a lookup table of full key names, and associate it with a short hstore key. So your app essentially compresses the keys in each hstore.

First column's data is missing

I am a newbie to ABAP. I am trying this program with open sql, and when I execute the program, the first column's data is always missing. I have looked up and the syntax appears to be correct. I am using kna1 table, the query is pretty simple too. If anybody notice the issue, please help me out.
DATA: WA_TAB_KNA1 TYPE KNA1,
IT_TAB_KNA1 TYPE TABLE OF KNA1,
V_KUNNR TYPE KUNNR.
SELECT-OPTIONS: P_KUNNR FOR V_KUNNR.
SELECT name1 kunnr name2
INTO TABLE IT_TAB_KNA1 FROM KNA1
WHERE KUNNR IN P_KUNNR.
LOOP AT IT_TAB_KNA1 INTO WA_TAB_KNA1.
WRITE:/ WA_TAB_KNA1-KUNNR,' ', WA_TAB_KNA1-NAME1.
ENDLOOP.
This is a classic - I suppose every ABAP developer has to experience this at least once.
You're using an internal table of structure KNA1, which means that your target variable has the following structure
ccckkkkkkkkkklllnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...
with ccc being the client, kkkkkkkkkk being the field KUNNR (10 characters), lll the field LAND1 (3 characters), then 35 ns for the field NAME1, 35 Ns for the field NAME2 and so on.
In your SELECT statement, you tell the system to retrieve the columns NAME1, KUNNR and NAME2 - in that order! This will yield a result set that has the following structure, using the nomenclature above:
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnkkkkkkkkkkNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Instead of raising some kind of type error, the system will then try to squeeze the data into the target structure - mainly for historical reasons. Because the first fields are all character fields, it will succeed. The result: the field MANDT of your internal table contains the first three characters of NAME1, the field KUNNR contains the characters 4-13 of the source field NAME1 and so on.
Fortunately the solution is easy: use INTO CORRESPONDING FIELDS OF TABLE instead of INTO TABLE. This will cause the system to use a fieldname-based mapping when filling the target table. As tomdemuyt mentioned it's also possible to roll your own target structure -- and for large data sets, that's a really good idea because you're wasting a lot of memory otherwise. Still, sometimes this is not an option, so you really have to know this error - recognize it as soon as you see it and know what to do.

How to search for a string in the whole database?

I have an informix database consisting of a large number of tables.
I know that there is a string "example" somewhere inside some table, but don't know which table it is or which column it is. (I know this is a very rare case)
Because of the large number of tables, there is no way to look for it manually. How do i find this value inside this large database? Is there any query to find it?
Thanks in advance!
Generally, you have two approaches.
One way would be to dump each table to individual text files and grep the files. This can be done manually on a small database, or via a script in the case of a larger one. Look into UNLOAD and dbaccess. Example in a BASH script: (you'll need to generate the table list in the script either statically or via a query.)
unload_tables () {
for table in ${TABLE_LIST} do
dbaccess database_name << EOF
unload to "${OUT_PATH}/${table}/.out"
select * from $table;
EOF
done
}
Or, this is a little more complicated. You can create a specific SELECT (filtering each column by "example") for each table and column in your db in a similar automated fashion using systables and syscolumns, then run each sql.
For example, this query shows you all columns in all tables:
SELECT tabname, colno, colname, coltype, collength
FROM systables a, syscolumns b
WHERE a.tabid = b.tabid
It is easy to adapt this such that the SELECT return a proper formatted SQL string that allows you to query the database for matches to "example". Sorry, I don't have a full solution at the ready, but if you google for "informix systables syscolumns", you will see lots of ways this information can be used.
In Informix, determining the columns that contain character data is fiddly, but doable. Fortunately, you're unlikely to be using the esoteric features such as DISTINCT TYPE that would make the search harder. (Alphadogg's suggestion of unload - using dbexport as I note in my comment to his answer - makes sense, especially if the text might appear in a CLOB or TEXT field.) You need to know that types CHAR, NCHAR, VARCHAR, NVARCHAR, and LVARCHAR have type numbers of 0, 13, 15, 16 and 43. Further, if the column is NOT NULL, 256 is added to the number. Hence, the character data columns in the database are found via this query:
SELECT t.owner, t.tabname, c.colname, t.tabid, c.colno
FROM "informix".systables t, "informix".syscolumns c
WHERE t.tabid = c.tabid
AND c.coltype IN (0, 13, 15, 16, 43, 256, 269, 271, 272, 299)
AND t.tabtype = 'T' -- exclude views, synonyms, etc.
AND t.tabid >= 100 -- exclude the system catalog
ORDER BY t.owner, t.tabname, c.colno;
This generates the list of places to search. You could get fancy and have the SELECT generate a string suitable for resubmission as a query - that's left as an exercise for the reader.

Resources