What would be the best way to store a large amount of text in a Database? I would expect about 2500 words and since on average each word in English is around 6 characters, I expect over 15000 characters. This text may be non-English so I guess I would need Unicode to support everything.
This text needs to be inserted, retrieved, and also searched by keywords.
Maxwell.
You should use NVARCHAR(MAX) as the datatype for that particular column in question. Also, I would suggest you have a FULLTEXT INDEX on that column since you said that column will also include searching by keywords.
You need to use NVARCHAR(MAX)
You can store..
1 billion, 73 million, 741 thousand and 823 characters
If you will insert non-english characters you have to use NVARCHAR, and also when inserting the data you have to prefix it with an N like this:
CREATE TABLE tmp( description NVARCHAR(MAX) )
INSERT INTO tmp VALUES (N'Добро...')
Related
In a SQL server database, I have a table Messages with the following columns:
Id INT(1,1)
Detail VARCHAR(5000)
DatetimeEntered DATETIME
PersonEntered VARCHAR(25)
Messages are pretty basic, and only allow alphanumeric characters and a handful of special characters, which are as follows:
`¬!"£$%^&*()-_=+[{]};:'##~\|,<.>/?
Ignoring the bulk of the special characters bar the apostrophe, what I need is a way to list each word along with how many times the word occurs in the Detail column, which I can then filter by PersonEntered and DatetimeEntered.
Example output:
Word Frequency
-----------------
a 11280
the 10102
and 8845
when 2024
don't 2013
.
.
.
It doesn't need to be particularly clever. It is perfectly fine if dont and don't are treated as separate words.
I'm having trouble splitting out the words into a temporary table called #Words.
Once I have a temporary table, I would apply the following query:
SELECT
Word,
SUM(Word) AS WordCount
FROM #Words
GROUP BY Word
ORDER BY SUM(Word) DESC
Please help.
Personally, I would strip out almost all the special characters, and then use a splitter on the space character. Of your permitted characters, only ' is going to appear in a word; anything else is going to be grammatical.
You haven't posted what version of SQL you're using, so I've going to use SQL Server 2017 syntax. If you don't have the latest version, you'll need to replace TRANSLATE with a nested REPLACE (So REPLACE(REPLACE(REPLACE(REPLACE(... REPLACE(M.Detail, '¬',' '),...),'/',' '),'?',' '), and find a string splitter (for example, Jeff Moden's DelimitedSplit8K).
USE Sandbox;
GO
CREATE TABLE [Messages] (Detail varchar(5000));
INSERT INTO [Messages]
VALUES ('Personally, I would strip out almost all the special characters, and then use a splitter on the space character. Of your permitted characters, only `''` is going to appear in a word; anything else is going to be grammatical. You haven''t posted what version of SQL you''re using, so I''ve going to use SQL Server 2017 syntax. If you don''t have the latest version, you''ll need to replace `TRANSLATE` with a nested `REPLACE` (So `REPLACE(REPLACE(REPLACE(REPLACE(... REPLACE(M.Detail, ''¬'','' ''),...),''/'','' ''),''?'','' '')`, and find a string splitter (for example, Jeff Moden''s [DelimitedSplit8K](http://www.sqlservercentral.com/articles/Tally+Table/72993/)).'),
('As a note, this is going to perform **AWFULLY**. SQL Server is not designed for this type of work. I also imagine you''ll get some odd results and it''ll include numbers in there. Things like dates are going to get split out,, numbers like `9,000,000` would be treated as the words `9` and `000`, and hyperlinks will be separated.')
GO
WITH Replacements AS(
SELECT TRANSLATE(Detail, '`¬!"£$%^&*()-_=+[{]};:##~\|,<.>/?',' ') AS StrippedDetail
FROM [Messages] M)
SELECT SS.[value], COUNT(*) AS WordCount
FROM Replacements R
CROSS APPLY string_split(R.StrippedDetail,' ') SS
WHERE LEN(SS.[value]) > 0
GROUP BY SS.[value]
ORDER BY WordCount DESC;
GO
DROP TABLE [Messages];
As a note, this is going to perform AWFULLY. SQL Server is not designed for this type of work. I also imagine you'll get some odd results and it'll include numbers in there. Things like dates are going to get split out,, numbers like 9,000,000 would be treated as the words 9 and 000, and hyperlinks will be separated.
I'm making a Windows Application by using Entity Framework.I have a table name "DichVu".
This is my talbe: "DichVu":
My problem here is when I insert new record to table "DichVu" the primary key just have 5 characters and I get successfully. But after that I write a method to get all records in my table "DichVu" and show them on GridControl my ID field(here is MaDV) has 10 characters (5 of them are blank).
I tried to use query in SQL Server and count the "MaDV" lenght but It showed me exactly 5 characters for each record.
Here is the result for using query:
And this is the result when I use the method to get those records:
As you can see in the above picture. I got the issue at the field "MaDV".
Hope everyone can help me. I will be grateful to everyone for helping me.
Use VarChar(10).
Char(10) will always be 10 characters long in storage, varchar (variable length characters) supports allowing different lengths.
In SQL Server 2012 I have a table with an nvarchar column with collation Latin1_General_100_CI_AS_SC, which is supposed to support unicode surrogate pair characters, or supplementary characters.
When I run this query:
select KeyValue from terms where KeyValue = N'➰'
(above is a Unicode SC)
above is a curly loop character with code 10160 (x27B0)
The result is hundreds of different looking single character entries, even though they all have different UTF-16 codepoints. Is this due to collation? Why isn't there an exact match?
EDIT: I now think this is due to collation. There seems to be a group of "undefined" characters in the UTF-16 range, more than 1733 characters, and they are treated as the same by this collation. Although, characters with codes above 65535 are treated as unique and those queries return exact matches.
The two queries below have different results:
select KeyValue from terms where KeyValue = N'π'
returns 3 rows: π and ℼ and ᴨ
select KeyValue from terms where KeyValue LIKE N'π'
returns 2 rows: π and ℼ
Why is this?
This is the weirdest of all. This query:
select KeyValue from terms where KeyValue like N'➰%'
returns ALMOST ALL records in the table, which has many multiple character regular latin character set terms like "8w" or "apple". 90% of those not being returned are starting with "æ". What is happening?
NOTE: Just to give this a bit of context, these are all Wikipedia article titles, not random strings.
SQL Server and thus tempdb also have their own collation, and they may not be the same as a database's or a column's collation. While character literals should be assigned the default collation of the column or database, the above (perhaps overly simplified) T-SQL examples could be misstating/not revealing the true problem. For example, an ORDER BY clause could have been omitted for the sake of simplicity. Are expected results returned when above statements explicitly use https://msdn.microsoft.com/en-us/library/ms184391.aspx ('COLLATE Latin1_General_100_CI_AS_SC')?
I have a table with an nvarchar column with collation Latin1_General_100_CI_AS_SC, which is supposed to support Unicode surrogate pair characters, or supplementary characters.
The Supplementary Character-Aware (SCA) collations — those ending with _SC or with _140_ in their names — do support supplementary characters. BUT, "support" only means that the built-in functions handle the surrogate pair as a single, supplementary code point instead a pair of surrogate code points. But support for sorting and comparison of supplementary characters actually started in SQL Server 2005 with the introduction of the version 90 collations.
even though they all have different UTF-16 codepoints. Is this due to collation? Why isn't there an exact match?
UTF-16 doesn't have code points, it is an encoding that encodes all Unicode code points.
Yes, this behavior is due to collation.
There is no exact match because (as you guessed), code point U+27B0 has no defined sort weight. Hence it is completely ignored and equates to an empty string or any other code point that has no sort weight.
There seems to be a group of "undefined" characters in the UTF-16 range, more than 1733 characters, and they are treated as the same by this collation.
Correct, though some only have a sort weight due to the accent sensitivity of the collation. You would get even more matches if you used Latin1_General_100_CI_AI_SC. And, to be clear, the UTF-16 "range" is all 1,114,112 Unicode code points.
The two queries below have different results ... Why is this?
I can't (yet!) explain why = vs LIKE returns different sets of matches, but there is 1 more character that equates to the 3 that you currently have:
SELECT KeyValue, CONVERT(VARBINARY(40), [KeyValue])
FROM (VALUES (N'π' COLLATE Latin1_General_100_CI_AS_SC), (N'ℼ'), (N'ᴨ'),
(N'Π')) t([KeyValue])
WHERE KeyValue = N'π'; -- 4 rows
SELECT KeyValue, CONVERT(VARBINARY(40), [KeyValue])
FROM (VALUES (N'π' COLLATE Latin1_General_100_CI_AS_SC), (N'ℼ'), (N'ᴨ'),
(N'Π')) t([KeyValue])
WHERE KeyValue LIKE N'π'; -- 3 rows
This is the weirdest of all. This query: ... returns ALMOST ALL records in the table
SELECT 1 WHERE NCHAR(0x27B0) = NCHAR(0x0000) COLLATE Latin1_General_100_CI_AS_SC;
-- 1
SELECT 2 WHERE NCHAR(0x27B0) = N'' COLLATE Latin1_General_100_CI_AS_SC;
-- 2
SELECT 3 WHERE NCHAR(0x27B0) = NCHAR(0x27B0) + NCHAR(0x27B0) + NCHAR(0x27B0)
COLLATE Latin1_General_100_CI_AS_SC;
-- 3
SELECT 4 WHERE N'➰' = N'➰ ➰ ➰ ➰' COLLATE Latin1_General_100_CI_AS_SC;
-- 4
SELECT 5 WHERE N'➰' LIKE N'➰ ➰ ➰ ➰' COLLATE Latin1_General_100_CI_AS_SC;
-- NO ROWS RETURNED!! (spaces matter here due to LIKE)
SELECT 6 WHERE N'➰' LIKE N'➰➰➰➰➰➰' COLLATE Latin1_General_100_CI_AS_SC;
-- 6
This, again, has something to do with the fact that "➰" has no sort weight defined. Of course, neither does æ, Þ, ß, LJ, etc.
I will update this answer once I figured out exactly what LIKE is doing differently than =.
For more info, please see:
How Many Bytes Per Character in SQL Server: a Completely Complete Guide
Collations Info
Say I have a table with 3 columns: varchar(20), hstore, smallint
Now if I insert the following: "ABCDEF", "abc=>123, xyz=>888, lmn=>102", 5
How much space will the record take in PostgreSQL? Is the hstore stored as plain text?
So if I have a million records, the space taken by the keys (abc,xyz,lmn) will be duplicated across all the records?
I'm asking this because I have a use case in which I need to store an unknown number of key,value pairs; with each key taking upto 20B and the value not more than smallint range.
The catch is that the number of records is massive, around 90 million a day. And the number of Key,Value pairs are ~400. This quickly leads storage problems since just a day's data would total upto around 800GB; with a massive percentage being taken by the Keys which are duplicated across all records.
So considering there are 400 Key/Value pairs, a single Hstore in a record (if stored as plain text) would take 400*22 Bytes. Multiplied by 90 Million, that is 737GB.
If stored in normal columns as 2 Byte ints, it will take just 67GB.
Are HStores not suitable for this use case? Do I have any option which can help me with this storage issue? I know this is a big ask and I might just have to go with a regular columnar storage solution and forgo the flexibility offered by key value.
How much space will the record take in PostgreSQL?
To get the raw uncompressed size:
SELECT pg_column_size( ROW( 'ABCDEF', 'abc=>123, xyz=>888, lmn=>102'::hstore, 5) );
but due to TOAST compressed out-of-line storage that might not be the on-disk size... though it often is:
CREATE TABLE blah(col1 text, col2 hstore, col3 integer);
INSERT INTO blah (col1, col2, col3)
VALUES ('ABCDEF', 'abc=>123, xyz=>888, lmn=>102'::hstore, 5);
regress=> SELECT pg_column_size(blah) FROM blah;
pg_column_size
----------------
84
(1 row)
If you used a bigger hstore value here it might get compressed and stored out of line. In that case, the size would depend on how compressible it is.
Is the hstore stored as plain text?
no, it's a binary format, but nor is it compressed; the keys/values are plain text.
So if I have a million records, the space taken by the keys (abc,xyz,lmn) will be duplicated across all the records?
Correct. Each hstore value is a standalone value. It has no relationship with any other value anywhere in the system. It's just like a text or json or whatever else. There's no sort of central key index or anything like that.
Demo:
CREATE TABLE hsdemo(hs hstore);
INSERT INTO hsdemo(hs)
SELECT hstore(ARRAY['thisisthefirstkey', 'thisisanotherbigkey'], ARRAY[x::text, x::text])
FROM generate_series(1,10000) x;
SELECT pg_size_pretty(pg_relation_size('hsdemo'::regclass));
-- prints 992kb
INSERT INTO hsdemo(hs)
SELECT hstore(ARRAY['thisisthefirstkey', 'thisisanotherbigkey'], ARRAY[x::text, x::text])
FROM generate_series(10000,20000) x;
SELECT pg_size_pretty(pg_relation_size('hsdemo'::regclass));
-- prints 1968kb, i.e. near doubling for double the records.
Thus, if you have many highly duplicated large keys and small values, you should probably look at a normalized schema (yes, even EAV).
However, be aware that PostgreSQL has quite a large per-row overhead of over 20 bytes per row, so you may not gain as much as you'd expect by storing huge numbers of short rows instead of something like a hstore.
You can always compromise - keep a lookup table of full key names, and associate it with a short hstore key. So your app essentially compresses the keys in each hstore.
Im facing a strange issue trying to move from sql server to oracle.
in one of my tables i have column defined by NVARCHAR(255)
after reading a bit i understod that SQL server is counting characters when oracle count bytes.
So i defined my table in oracle as VARCHAR(510) 255*2 = 510
But when using sqlldr to load the data from a tab delimetered text file i get en error indicating some entries had exiceeded the length of this column.
after checking in the sql server using:
SELECT MAX(DATALENGTH(column))
FROM table
i get that the max data length is 510.
I do use Hebrew_CI_AS collationg even though i dont think it changes anything....
I checked in SQL Server also if any of the entries contains TAB but no... so i guess its not a corrupted data....
Any one have an idea?
EDIT
After further checkup i've noticed that the issue is due to the data file (in addition to the issue solved by #Justin Cave post.
I have changed the row delimeter to '^' since none of my data contains this character and '|^|' as column delimeter.
creating a control file as follows:
load data
infile data.txt "str '^'"
badfile "data_BAD.txt"
discardfile "data_DSC.txt"
into table table
FIELDS TERMINATED BY '|^|' TRAILING NULLCOLS
(
col1,
col2,
col3,
col4,
col5,
col6
)
The problem is that my data contain <CR> and sqlldr expecting a stream file there for fails on the <CR>!!!! i do not want to change the data since its a textual data (error messages for examples).
What is your database character set
SELECT parameter, value
FROM v$nls_parameters
WHERE parameter LIKE '%CHARACTERSET'
Assuming that your database character set is AL32UTF8, each character could require up to 4 bytes of storage (though almost every useful character can be represented with at most 3 bytes of storage). So you could declare your column as VARCHAR2(1020) to ensure that you have enough space.
You could also simply use character length semantics. If you declare your column VARCHAR2(255 CHAR), you'll allocate space for 255 characters regardless of the amount of space that requires. If you change the NLS_LENGTH_SEMANTICS initialization parameter from the default BYTE to CHAR, you'll change the default so that VARCHAR2(255) is interpreted as VARCHAR2(255 CHAR) rather than VARCHAR2(255 BYTE). Note that the 4000-byte limit on a VARCHAR2 remains even if you are using character length semantics.
If your data contains line breaks, do you need the TRAILING NULLCOLS parameter? That implies that sometimes columns may be omitted from the end of a logical row. If you combine columns that may be omitted with columns that contain line breaks and data that is not enclosed by at least an optional enclosure character, it's not obvious to me how you would begin to identify where a logical row ended and where it began. If you don't actually need the TRAILING NULLCOLS parameter, you should be able to use the CONTINUEIF parameter to combine multiple physical rows into a single logical row. If you can change the data file format, I'd strongly suggest adding an optional enclosure character.
The bytes used by an NVARCHAR field is equal to two times the number of characters plus two (see http://msdn.microsoft.com/en-us/library/ms186939.aspx), so if you make your VARCHAR field 512 you may be OK. There's also some indication that some character sets use 4 bytes per character, but I've found no indication that Hebrew is one of these character sets.