Total records retrieved SOSL query - salesforce

It's established that the total number of records retrieved by a single sosl query is 2000... but... I have tried in my org with a sosl for which 10000 records fit... and finally only recorvers 250 records. Isn’t that supposed to be 2,000?

Related

What does row_produced count represent in snowflake query_history view when query is MERGE from file

I am executing MERGE query to perform CDC operation. I have a target table which is holding around 50 million records and the incoming file which is source for MERGE contains 230 records. There is simple join in ID of table and id column from file data. After execution , the History view shows records inserted 200 and records updated 30. However its showing rows_produced as 5K. I need to understand what does rows_produced in this case. Does it show the rows return as a part of join ? if its yes, then it should be matching the row count of file.
I believe that rows_produced is the total number of records that were created when the underlying micropartitions were written out.
For example, if you updated 1 record, you are actually recreating the entire micropartition of data that this 1 record exists in (micropartitions are immutable, so therefore never updated). If that 1 record exists in a micropartition that contains 100 records, then you'd get an output that has 1 record updated, but 100 rows_produced.
This information is "interesting" but not helpful when trying to make sure the right outcome of your MERGE statement. Using the insert, update, and delete output for the MERGE is the accurate way to look at that.

Postgres check if any new rows were inserted

I have numerous quite large tables (300-400 tables, ~30 million rows each). Everyday (once a day) I have to check if any new rows were inserted into any of these tables. Possible number of rows inserted may vary from 0 to 30 million rows. Rows are not going to be deleted.
At the moment, I check if any new rows were inserted using approximate count. And then compare it with previous (yesterday) result.
SELECT reltuples FROM pg_class WHERE oid='tablename'::regclass;
The main thing I doubt: how soon reltuples will be updated if, for example, 3000 rows will be inserted (or 5 rows inserted)? And is approximate count a good solution for that case?
My config parameters are:
autovacuum_analyze_threshold: 50
autovacuum_analyze_scale_factor: 0.1
reltuples will be updated whenever VACUUM (ir autovacuum) runs, so this number normally has an error margin of 20%.
You'll get a better estimate for the number of rows in the table from the table statistics view:
SELECT n_live_tup
FROM pg_stat_user_tables
WHERE schemaname = 'myschema' AND relname = 'mytable';
This number is updated by the statistics collector, so it is not guaranteed to be 100% accurate (there is a UDP socket involved), and it may take a little while for the effects of a data modification to be visible there.
Still it is often a more accurate estimate than reltuples.

Big Query API - Quota exceeded: Your table exceeded quota for UPDATE or DELETE queries per table

As described in this question there's a quota limit for how many updates you might do per day.
I've looked up the docs and this part caught my attention:
Daily update limit: 1,000 updates per table per day; applies only to the destination table in a query.
The problem is that I didn't ran 1000 updates on my table (maybe 80.. 150 max)
So I'd like to know either if there's a solution for this or the docs are out of date.
Thanks.
EDIT
This only happens if I use the Big Query API, I'm able to update tables through the console
Looks like you are using DML
Per documentation:
DML statements are significantly more expensive to process than SELECT
statements.
Maximum UPDATE/DELETE statements per day per table: 48
Maximum UPDATE/DELETE statements per day per project: 500
Maximum INSERT statements per day per table: 1,000
Maximum INSERT statements per day per project: 10,000
You can see more details in this documentation
The quota you referenced in your question related to Query - see more here

SQLite insert speed slows as number of records increases due to an index

Original question
Background
It is well-known that SQLite needs to be fine tuned to achieve insert speeds on the order of 50k inserts/s. There are many questions here regarding slow insert speeds and a wealth of advice and benchmarks.
There are also claims that SQLite can handle large amounts of data, with reports of 50+ GB not causing any problems with the right settings.
I have followed the advice here and elsewhere to achieve these speeds and I'm happy with 35k-45k inserts/s. The problem I have is that all of the benchmarks only demonstrate fast insert speeds with < 1m records. What I am seeing is that insert speed seems to be inversely proportional to table size.
Issue
My use case requires storing 500m to 1b tuples ([x_id, y_id, z_id]) over a few years (1m rows / day) in a link table. The values are all integer IDs between 1 and 2,000,000. There is a single index on z_id.
Performance is great for the first 10m rows, ~35k inserts/s, but by the time the table has ~20m rows, performance starts to suffer. I'm now seeing about 100 inserts/s.
The size of the table is not particularly large. With 20m rows, the size on disk is around 500MB.
The project is written in Perl.
Question
Is this the reality of large tables in SQLite or are there any secrets to maintaining high insert rates for tables with > 10m rows?
Known workarounds which I'd like to avoid if possible
Drop the index, add the records, and re-index: This is fine as a workaround, but doesn't work when the DB still needs to be usable during updates. It won't work to make the database completely inaccessible for x minutes / day
Break the table into smaller subtables / files: This will work in the short term and I have already experimented with it. The problem is that I need to be able to retrieve data from the entire history when querying which means that eventually I'll hit the 62 table attachment limit. Attaching, collecting results in a temp table, and detaching hundreds of times per request seems to be a lot of work and overhead, but I'll try it if there are no other alternatives.
Set SQLITE_FCNTL_CHUNK_SIZE: I don't know C (?!), so I'd prefer to not learn it just to get this done. I can't see any way to set this parameter using Perl though.
UPDATE
Following Tim's suggestion that an index was causing increasingly
slow insert times despite SQLite's claims that it is capable
of handling large data sets, I performed a benchmark comparison with the following
settings:
inserted rows: 14 million
commit batch size: 50,000 records
cache_size pragma: 10,000
page_size pragma: 4,096
temp_store pragma: memory
journal_mode pragma: delete
synchronous pragma: off
In my project, as in the benchmark results below, a file-based temporary table is created and SQLite's built-in support
for importing CSV data is used. The temporary table is then attached
to the receiving database and sets of 50,000 rows are inserted with an
insert-select statement. Therefore, the insert times do not reflect
file to database insert times, but rather table to table insert
speed. Taking the CSV import time into account would reduce the speeds
by 25-50% (a very rough estimate, it doesn't take long to import the
CSV data).
Clearly having an index causes the slowdown in insert speed as table size increases.
It's quite clear from the data above that the correct answer can be assigned to Tim's answer rather than the assertions that SQLite just can't handle it. Clearly it can handle large datasets if indexing that dataset is not part of your use case. I have been using SQLite for just that, as a backend for a logging system, for a while now which does not need to be indexed, so I was quite surprised at the slowdown I experienced.
Conclusion
If anyone finds themselves wanting to store a large amount of data using SQLite and have it indexed, using shards may be the answer. I eventually settled on using the first three characters of an MD5 hash a unique column in z to determine assignment to one of 4,096 databases. Since my use case is primarily archival in nature, the schema will not change and queries will never require shard walking. There is a limit to database size since extremely old data will be reduced and eventually discarded, so this combination of sharding, pragma settings, and even some denormalisation gives me a nice balance that will, based on the benchmarking above, maintain an insert speed of at least 10k inserts / second.
If your requirement is to find a particular z_id and the x_ids and y_ids linked to it (as distinct from quickly selecting a range of z_ids) you could look into a non-indexed hash-table nested-relational db that would allow you to instantly find your way to a particular z_id in order to get its y_ids and x_ids -- without the indexing overhead and the concomitant degraded performance during inserts as the index grows. In order to avoid clumping (aka bucket collisions), choose a key hashing algorithm that puts greatest weight on the digits of z_id with greatest variation (right-weighted).
P.S. A database that uses a b-tree may at first appear faster than a db that uses linear hashing, say, but the insert performance will remain level with the linear hash as performance on the b-tree begins to degrade.
P.P.S. To answer #kawing-chiu's question: the core feature relevant here is that such a database relies on so-called "sparse" tables in which the physical location of a record is determined by a hashing algorithm which takes the record key as input. This approach permits a seek directly to the record's location in the table without the intermediary of an index. As there is no need to traverse indexes or to re-balance indexes, insert-times remain constant as the table becomes more densely populated. With a b-tree, by contrast, insert times degrade as the index tree grows. OLTP applications with large numbers of concurrent inserts can benefit from such a sparse-table approach. The records are scattered throughout the table. The downside of records being scattered across the "tundra" of the sparse table is that gathering large sets of records which have a value in common, such as a postal code, can be slower. The hashed sparse-table approach is optimized to insert and retrieve individual records, and to retrieve networks of related records, not large sets of records that have some field value in common.
A nested relational database is one that permits tuples within a column of a row.
Great question and very interesting follow-up!
I would just like to make a quick remark: You mentioned that breaking the table into smaller subtables / files and attaching them later is not an option because you'll quickly reach the hard limit of 62 attached databases. While this is completely true, I don't think you have considered a mid-way option: sharding the data into several tables but keep using the same, single database (file).
I did a very crude benchmark just to make sure my suggestion really has an impact on performance.
Schema:
CREATE TABLE IF NOT EXISTS "test_$i"
(
"i" integer NOT NULL,
"md5" text(32) NOT NULL
);
Data - 2 Million Rows:
i = 1..2,000,000
md5 = md5 hex digest of i
Each transaction = 50,000 INSERTs.
Databases: 1; Tables: 1; Indexes: 0
0..50000 records inserted in 1.87 seconds
50000..100000 records inserted in 1.92 seconds
100000..150000 records inserted in 1.97 seconds
150000..200000 records inserted in 1.99 seconds
200000..250000 records inserted in 2.19 seconds
250000..300000 records inserted in 1.94 seconds
300000..350000 records inserted in 1.94 seconds
350000..400000 records inserted in 1.94 seconds
400000..450000 records inserted in 1.94 seconds
450000..500000 records inserted in 2.50 seconds
500000..550000 records inserted in 1.94 seconds
550000..600000 records inserted in 1.94 seconds
600000..650000 records inserted in 1.93 seconds
650000..700000 records inserted in 1.94 seconds
700000..750000 records inserted in 1.94 seconds
750000..800000 records inserted in 1.94 seconds
800000..850000 records inserted in 1.93 seconds
850000..900000 records inserted in 1.95 seconds
900000..950000 records inserted in 1.94 seconds
950000..1000000 records inserted in 1.94 seconds
1000000..1050000 records inserted in 1.95 seconds
1050000..1100000 records inserted in 1.95 seconds
1100000..1150000 records inserted in 1.95 seconds
1150000..1200000 records inserted in 1.95 seconds
1200000..1250000 records inserted in 1.96 seconds
1250000..1300000 records inserted in 1.98 seconds
1300000..1350000 records inserted in 1.95 seconds
1350000..1400000 records inserted in 1.95 seconds
1400000..1450000 records inserted in 1.95 seconds
1450000..1500000 records inserted in 1.95 seconds
1500000..1550000 records inserted in 1.95 seconds
1550000..1600000 records inserted in 1.95 seconds
1600000..1650000 records inserted in 1.95 seconds
1650000..1700000 records inserted in 1.96 seconds
1700000..1750000 records inserted in 1.95 seconds
1750000..1800000 records inserted in 1.95 seconds
1800000..1850000 records inserted in 1.94 seconds
1850000..1900000 records inserted in 1.95 seconds
1900000..1950000 records inserted in 1.95 seconds
1950000..2000000 records inserted in 1.95 seconds
Database file size: 89.2 MiB.
Databases: 1; Tables: 1; Indexes: 1 (md5)
0..50000 records inserted in 2.90 seconds
50000..100000 records inserted in 11.64 seconds
100000..150000 records inserted in 10.85 seconds
150000..200000 records inserted in 10.62 seconds
200000..250000 records inserted in 11.28 seconds
250000..300000 records inserted in 12.09 seconds
300000..350000 records inserted in 10.60 seconds
350000..400000 records inserted in 12.25 seconds
400000..450000 records inserted in 13.83 seconds
450000..500000 records inserted in 14.48 seconds
500000..550000 records inserted in 11.08 seconds
550000..600000 records inserted in 10.72 seconds
600000..650000 records inserted in 14.99 seconds
650000..700000 records inserted in 10.85 seconds
700000..750000 records inserted in 11.25 seconds
750000..800000 records inserted in 17.68 seconds
800000..850000 records inserted in 14.44 seconds
850000..900000 records inserted in 19.46 seconds
900000..950000 records inserted in 16.41 seconds
950000..1000000 records inserted in 22.41 seconds
1000000..1050000 records inserted in 24.68 seconds
1050000..1100000 records inserted in 28.12 seconds
1100000..1150000 records inserted in 26.85 seconds
1150000..1200000 records inserted in 28.57 seconds
1200000..1250000 records inserted in 29.17 seconds
1250000..1300000 records inserted in 36.99 seconds
1300000..1350000 records inserted in 30.66 seconds
1350000..1400000 records inserted in 32.06 seconds
1400000..1450000 records inserted in 33.14 seconds
1450000..1500000 records inserted in 47.74 seconds
1500000..1550000 records inserted in 34.51 seconds
1550000..1600000 records inserted in 39.16 seconds
1600000..1650000 records inserted in 37.69 seconds
1650000..1700000 records inserted in 37.82 seconds
1700000..1750000 records inserted in 41.43 seconds
1750000..1800000 records inserted in 49.58 seconds
1800000..1850000 records inserted in 44.08 seconds
1850000..1900000 records inserted in 57.17 seconds
1900000..1950000 records inserted in 50.04 seconds
1950000..2000000 records inserted in 42.15 seconds
Database file size: 181.1 MiB.
Databases: 1; Tables: 20 (one per 100,000 records); Indexes: 1 (md5)
0..50000 records inserted in 2.91 seconds
50000..100000 records inserted in 10.30 seconds
100000..150000 records inserted in 10.85 seconds
150000..200000 records inserted in 10.45 seconds
200000..250000 records inserted in 10.11 seconds
250000..300000 records inserted in 11.04 seconds
300000..350000 records inserted in 10.25 seconds
350000..400000 records inserted in 10.36 seconds
400000..450000 records inserted in 11.48 seconds
450000..500000 records inserted in 10.97 seconds
500000..550000 records inserted in 10.86 seconds
550000..600000 records inserted in 10.35 seconds
600000..650000 records inserted in 10.77 seconds
650000..700000 records inserted in 10.62 seconds
700000..750000 records inserted in 10.57 seconds
750000..800000 records inserted in 11.13 seconds
800000..850000 records inserted in 10.44 seconds
850000..900000 records inserted in 10.40 seconds
900000..950000 records inserted in 10.70 seconds
950000..1000000 records inserted in 10.53 seconds
1000000..1050000 records inserted in 10.98 seconds
1050000..1100000 records inserted in 11.56 seconds
1100000..1150000 records inserted in 10.66 seconds
1150000..1200000 records inserted in 10.38 seconds
1200000..1250000 records inserted in 10.24 seconds
1250000..1300000 records inserted in 10.80 seconds
1300000..1350000 records inserted in 10.85 seconds
1350000..1400000 records inserted in 10.46 seconds
1400000..1450000 records inserted in 10.25 seconds
1450000..1500000 records inserted in 10.98 seconds
1500000..1550000 records inserted in 10.15 seconds
1550000..1600000 records inserted in 11.81 seconds
1600000..1650000 records inserted in 10.80 seconds
1650000..1700000 records inserted in 11.06 seconds
1700000..1750000 records inserted in 10.24 seconds
1750000..1800000 records inserted in 10.57 seconds
1800000..1850000 records inserted in 11.54 seconds
1850000..1900000 records inserted in 10.80 seconds
1900000..1950000 records inserted in 11.07 seconds
1950000..2000000 records inserted in 13.27 seconds
Database file size: 180.1 MiB.
As you can see, the insert speed remains pretty much constant if you shard the data into several tables.
Unfortunately I'd say this is a limitation of large tables in SQLite. It's not designed to operate on large-scale or large-volume datasets. While I understand it may drastically increase project complexity, you're probably better off researching more sophisticated database solutions appropriate to your needs.
From everything you linked, it looks like table size to access speed is a direct tradeoff. Can't have both.
I suspect the Index's hash value collision causes the insert speed slow.
When we have many many rows in one table, and then the indexed column hash value collision will happen more frequently. It means Sqlite engine needs to calculate the hash value two times or three times, or maybe even four times, in order to get a different hash value.
So I guess this is the root cause of the SQLite insert slowness when the table has many rows.
This point could explain why using shards could avoid this problem.
Who's a real expert in SQLite domain to confirm or deny my point here?
In my project, I couldn't shard the database, as it's indexed on different columns. To speed up the inserts, I've put the database during creation on /dev/shm (=linux ramdisk) and then copy it over to local disk. That obviously only works well for a write-once, read-many database.

Unexpected estimated rows in query execution plan (Sql Server 2000)

if I run this query
select user from largetable where largetable.user = 1155
(note I'm querying user just to reduce this to its simplest case)
And look at the execution plan, an index seek is planned [largetable has an index on user], and the estimated rows is the correct 29.
But if I do
select user from largetable where largetable.user = (select user from users where externalid = 100)
[with the result of the sub query being the single value 1155 just like above when i hard code it]
The query optimizer estimates 117,000 rows in the result. There are about 6,000,000 rows in largetable, 1700 rows in users. When I run the query of course I get back the correct 29 rows despite the huge estimated rows.
I have updated stats with fullscan on both tables on the relevent indexes, and when I look at the stats, they appear to be correct.
Of note, for any given user, there are no more than 3,000 rows in largetable.
So, why would the estimated execution plan show such a large number of estimated rows? Shouldn't the optimizer know, based on the stats, that it's looking for a result that has 29 corresponding rows, or a MAXIMUM of 3,000 rows even if it doesn't know the user which will be selected by the subquery? Why this huge estimate? The problem is, that this large estimate is then influencing another join in a larger query to do a scan instead of a seek. If I run the larger query with the subquery, it takes 1min 40 secs. If run it with the 1155 hard coded it takes 2 seconds. This is very unusual to me...
Thanks,
Chris
The optimizer does the best it can, but statistics and row count estimations only go so far (as you're seeing).
I'm assuming that your more complex query can't easily be rewritten as a join without a subquery. If it can be, you should attempt that first.
Failing that, it's time for you to use your additional knowledge about the nature of your data to help out the optimizer with hints. Specifically look at the forceseek option in the index hints. Note that this can be bad if your data changes later, so be aware.
Did you try this?
SELECT lt.user
FROM Users u
INNER JOIN largeTable lt
ON u.User = lt.User
WHERE u.externalId = 100
Please see this: subqueries-vs-joins

Resources