How can I tell how many objects I've stored in an S3 bucket? - file

Unless I'm missing something, it seems that none of the APIs I've looked at will tell you how many objects are in an <S3 bucket>/<folder>. Is there any way to get a count?

Using AWS CLI
aws s3 ls s3://mybucket/ --recursive | wc -l
or
aws cloudwatch get-metric-statistics \
--namespace AWS/S3 --metric-name NumberOfObjects \
--dimensions Name=BucketName,Value=BUCKETNAME \
Name=StorageType,Value=AllStorageTypes \
--start-time 2016-11-05T00:00 --end-time 2016-11-05T00:10 \
--period 60 --statistic Average
Note: The above cloudwatch command seems to work for some while not for others. Discussed here: https://forums.aws.amazon.com/thread.jspa?threadID=217050
Using AWS Web Console
You can look at cloudwatch's metric section to get approx number of objects stored.
I have approx 50 Million products and it took more than an hour to count using aws s3 ls

There is a --summarize switch that shows bucket summary information (i.e. number of objects, total size).
Here's the correct answer using AWS cli:
aws s3 ls s3://bucketName/path/ --recursive --summarize | grep "Total Objects:"
Total Objects: 194273
See the documentation

Although this is an old question, and feedback was provided in 2015, right now it's much simpler, as S3 Web Console has enabled a "Get Size" option:
Which provides the following:

There is an easy solution with the S3 API now (available in the AWS cli):
aws s3api list-objects --bucket BUCKETNAME --output json --query "[length(Contents[])]"
or for a specific folder:
aws s3api list-objects --bucket BUCKETNAME --prefix "folder/subfolder/" --output json --query "[length(Contents[])]"

If you use the s3cmd command-line tool, you can get a recursive listing of a particular bucket, outputting it to a text file.
s3cmd ls -r s3://logs.mybucket/subfolder/ > listing.txt
Then in linux you can run a wc -l on the file to count the lines (1 line per object).
wc -l listing.txt

There is no way, unless you
list them all in batches of 1000 (which can be slow and suck bandwidth - amazon seems to never compress the XML responses), or
log into your account on S3, and go Account - Usage. It seems the billing dept knows exactly how many objects you have stored!
Simply downloading the list of all your objects will actually take some time and cost some money if you have 50 million objects stored.
Also see this thread about StorageObjectCount - which is in the usage data.
An S3 API to get at least the basics, even if it was hours old, would be great.

You can use AWS cloudwatch metrics for s3 to see exact count for each bucket.

2020/10/22
With AWS Console
Look at Metrics tab on your bucket
or:
Look at AWS Cloudwatch's metrics
With AWS CLI
Number of objects:
or:
aws s3api list-objects --bucket <BUCKET_NAME> --prefix "<FOLDER_NAME>" | wc -l
or:
aws s3 ls s3://<BUCKET_NAME>/<FOLDER_NAME>/ --recursive --summarize --human-readable | grep "Total Objects"
or with s4cmd:
s4cmd ls -r s3://<BUCKET_NAME>/<FOLDER_NAME>/ | wc -l
Objects size:
aws s3api list-objects --bucket <BUCKET_NAME> --output json --query "[sum(Contents[].Size), length(Contents[])]" | awk 'NR!=2 {print $0;next} NR==2 {print $0/1024/1024/1024" GB"}'
or:
aws s3 ls s3://<BUCKET_NAME>/<FOLDER_NAME>/ --recursive --summarize --human-readable | grep "Total Size"
or with s4cmd:
s4cmd du s3://<BUCKET_NAME>
or with CloudWatch metrics:
aws cloudwatch get-metric-statistics --metric-name BucketSizeBytes --namespace AWS/S3 --start-time 2020-10-20T16:00:00Z --end-time 2020-10-22T17:00:00Z --period 3600 --statistics Average --unit Bytes --dimensions Name=BucketName,Value=<BUCKET_NAME> Name=StorageType,Value=StandardStorage --output json | grep "Average"

2021 Answer
This information is now surfaced in the AWS dashboard. Simply navigate to the bucket and click the Metrics tab.

If you are using AWS CLI on Windows, you can use the Measure-Object from PowerShell to get the total counts of files, just like wc -l on *nix.
PS C:\> aws s3 ls s3://mybucket/ --recursive | Measure-Object
Count : 25
Average :
Sum :
Maximum :
Minimum :
Property :
Hope it helps.

Go to AWS Billing, then reports, then AWS Usage reports.
Select Amazon Simple Storage Service, then Operation StandardStorage.
Then you can download a CSV file that includes a UsageType of StorageObjectCount that lists the item count for each bucket.

From the command line in AWS CLI, use ls plus --summarize. It will give you the list of all of your items and the total number of documents in a particular bucket. I have not tried this with buckets containing sub-buckets:
aws s3 ls "s3://MyBucket" --summarize
It make take a bit long (it took listing my 16+K documents about 4 minutes), but it's faster than counting 1K at a time.

You can easily get the total count and the history if you go to the s3 console "Management" tab and then click on "Metrics"... Screen shot of the tab

One of the simplest ways to count number of objects in s3 is:
Step 1: Select root folder
Step 2: Click on Actions -> Delete (obviously, be careful - don't delete it)
Step 3: Wait for a few mins aws will show you number of objects and its total size.

As of November 18, 2020 there is now an easier way to get this information without taxing your API requests:
AWS S3 Storage Lens
The default, built-in, free dashboard allows you to see the count for all buckets, or individual buckets under the "Buckets" tab. There are many drop downs to filter and sort almost any reasonable metric you would look for.

In s3cmd, simply run the following command (on a Ubuntu system):
s3cmd ls -r s3://mybucket | wc -l

None of the APIs will give you a count because there really isn't any Amazon specific API to do that. You have to just run a list-contents and count the number of results that are returned.

You can just execute this cli command to get the total file count in the bucket or a specific folder
Scan whole bucket
aws s3api list-objects-v2 --bucket testbucket | grep "Key" | wc -l
aws s3api list-objects-v2 --bucket BUCKET_NAME | grep "Key" | wc -l
you can use this command to get in details
aws s3api list-objects-v2 --bucket BUCKET_NAME
Scan a specific folder
aws s3api list-objects-v2 --bucket testbucket --prefix testfolder --start-after testfolder/ | grep "Key" | wc -l
aws s3api list-objects-v2 --bucket BUCKET_NAME --prefix FOLDER_NAME --start-after FOLDER_NAME/ | grep "Key" | wc -l

Select the bucket/Folder-> Click on actions -> Click on Calculate Total Size

The api will return the list in increments of 1000. Check the IsTruncated property to see if there are still more. If there are, you need to make another call and pass the last key that you got as the Marker property on the next call. You would then continue to loop like this until IsTruncated is false.
See this Amazon doc for more info: Iterating Through Multi-Page Results

Old thread, but still relevant as I was looking for the answer until I just figured this out. I wanted a file count using a GUI-based tool (i.e. no code). I happen to already use a tool called 3Hub for drag & drop transfers to and from S3. I wanted to know how many files I had in a particular bucket (I don't think billing breaks it down by buckets).
So, using 3Hub,
- list the contents of the bucket (looks basically like a finder or explorer window)
- go to the bottom of the list, click 'show all'
- select all (ctrl+a)
- choose copy URLs from right-click menu
- paste the list into a text file (I use TextWrangler for Mac)
- look at the line count
I had 20521 files in the bucket and did the file count in less than a minute.

I used the python script from scalablelogic.com (adding in the count logging). Worked great.
#!/usr/local/bin/python
import sys
from boto.s3.connection import S3Connection
s3bucket = S3Connection().get_bucket(sys.argv[1])
size = 0
totalCount = 0
for key in s3bucket.list():
totalCount += 1
size += key.size
print 'total size:'
print "%.3f GB" % (size*1.0/1024/1024/1024)
print 'total count:'
print totalCount

The issue #Mayank Jaiswal mentioned about using cloudwatch metrics should not actually be an issue. If you aren't getting results, your range just might not be wide enough. It's currently Nov 3, and I wasn't getting results no matter what I tried. I went to the s3 bucket and looked at the counts and the last record for the "Total number of objects" count was Nov 1.
So here is how the cloudwatch solution looks like using javascript aws-sdk:
import aws from 'aws-sdk';
import { startOfMonth } from 'date-fns';
const region = 'us-east-1';
const profile = 'default';
const credentials = new aws.SharedIniFileCredentials({ profile });
aws.config.update({ region, credentials });
export const main = async () => {
const cw = new aws.CloudWatch();
const bucket_name = 'MY_BUCKET_NAME';
const end = new Date();
const start = startOfMonth(end);
const results = await cw
.getMetricStatistics({
// #ts-ignore
Namespace: 'AWS/S3',
MetricName: 'NumberOfObjects',
Period: 3600 * 24,
StartTime: start.toISOString(),
EndTime: end.toISOString(),
Statistics: ['Average'],
Dimensions: [
{ Name: 'BucketName', Value: bucket_name },
{ Name: 'StorageType', Value: 'AllStorageTypes' },
],
Unit: 'Count',
})
.promise();
console.log({ results });
};
main()
.then(() => console.log('Done.'))
.catch((err) => console.error(err));
Notice two things:
The start of the range is set to the beginning of the month
The period is set to a day. Any less and you might get an error saying that you have requested too many data points.

aws s3 ls s3://bucket-name/folder-prefix-if-any --recursive | wc -l

Here's the boto3 version of the python script embedded above.
import sys
import boto3
s3 = boto3.resource("s3")
s3bucket = s3.Bucket(sys.argv[1])
size = 0
totalCount = 0
for key in s3bucket.objects.all():
totalCount += 1
size += key.size
print("total size:")
print("%.3f GB" % (size * 1.0 / 1024 / 1024 / 1024))
print("total count:")
print(totalCount)

3Hub is discontinued. There's a better solution, you can use Transmit (Mac only), then you just connect to your bucket and choose Show Item Count from the View menu.

You can download and install s3 browser from http://s3browser.com/. When you select a bucket in the center right corner you can see the number of files in the bucket. But, the size it shows is incorrect in the current version.
Gubs

You can potentially use Amazon S3 inventory that will give you list of objects in a csv file

Can also be done with gsutil du (Yes, a Google Cloud tool)
gsutil du s3://mybucket/ | wc -l

If you're looking for specific files, let's say .jpg images, you can do the following:
aws s3 ls s3://your_bucket | grep jpg | wc -l

Related

How can I use the WooCommerce CLI to update a customer's metadata?

I feel like I should be able to do
sudo wp wc customer update 141 --meta_data="[{"key":"test_meta1","value":"test_meta_value"}]" --user=1
However, this does not work.
The result I get is
Success: Updated customer 141.
However,
sudo wp wc customer get 141 --fields=meta_data --user=1 | grep test_meta
returns blank.
Almost certainly, I'm not providing the data in the right format, but I'm not sure what it is, or where its documented.
Possibly helpful links:
How to add meta_data to the a WooCommerce customer
https://github.com/woocommerce/woocommerce/issues/18810
https://github.com/woocommerce/woocommerce/wiki/WC-CLI-Overview
The syntax is
--meta_data='[{"key": "test_meta1","value": "test_meta_value"}]'
eg
sudo wp wc customer update 141 --meta_data='[{"key": "test_meta1","value": "test_meta_value"}]' --user=1
or
sudo wp wc product create --name="test4" --regular_price=10 --meta_data='[{"key": "test_meta1","value": "test_meta_value"}]' --user=1
will work

Gcp appengine app does not correctly connect to Cloud sql database

I have connection issue between my spring boot/kotlin app and database. I tried following guides what google has to offer and some Stackoverflow posts. I added client/admin sql permisions to users.
After following google guide https://github.com/spring-cloud/spring-cloud-gcp/tree/master/spring-cloud-gcp-samples/spring-cloud-gcp-sql-postgres-sample on this link and recreating everything as it is. I came to problem: deployed App does not connect to actual database, instead it uses data.sql and schema.sql objects from resources, also it does not do changes to actual database, but somewhere else, for example I am using this command to getdata:
#GetMapping("/getdata")
fun getTuples(): List<String>? {
return jdbcTemplate!!.queryForList("SELECT * FROM users").stream()
.map { m: Map<String?, Any?> -> m.values.toString() }
.collect(Collectors.toList())
}
result I get is the one provided in example:
0 "[luisao#example.com, Anderson, Silva]"
1 "[jonas#example.com, Jonas, Goncalves]"
2 "[fejsa#example.com, Ljubomir, Fejsa]"
the actual users table in Cloud database looks like this:
postgres=> select * from users;
email | first_name | last_name
--------------------+------------+-----------
luisao#example.com | Anderson | Silva
jonas#example.com | Jonas | Goncalves
fejsa#example.com | Ljubomir | Fejsa
kkskldk | sjndjfdf | skdskd
(4 rows)
first 3 rows are same just because I added them manually. App does not see DDL/DML changes in database, only those I do in data.sql or schema.sql. So the question is where could be a problem?
in this documentation https://cloud.spring.io/spring-cloud-gcp/multi/multi__spring_jdbc.html says that this dependency should do most of the stuff automaticly but somehow something goes wrong.
Could it be that spring boot/hikari creates virtual enviroment in appengine which blocks database connection and you connect and see only Mock data from data.sql files? Or problem lies in configuration?
My application.properties file:
server.error.include-message=always
spring.cloud.gcp.sql.database-name=[database name]
spring.cloud.gcp.sql.instance-connection-name=[Instance name]
spring.datasource.continue-on-error=true
spring.datasource.initialization-mode=always
spring.datasource.username=postgres
spring.datasource.password=[password]
spring.cloud.gcp.project-id=[project id]

Creating a batch file (or something else maybe?) that scans a .txt file and then copies the specified text to another .txt file

I own a small minecraft server, and I would like to create a google spreadsheet for calculating user playtime data. I wan't this data because it would help let me know if my advertising campaigns are working or not. You can try to eyeball this stuff, but a solid data set would be alot more effective than guessing if the advertising is effective. The problem lies in the fact that manually searching for data from the server logs is really hard. I would appreciate anyone who could help me build a simple script or something that reads a .txt file and extracts the data I need. The script needs to be able to:
Detect lines with "User Authenticator" and "Disconnected" then print the entire line.
Format the text in some way? Possibly alphabetize the lines so that were not all over the place looking for specific users logins and logouts, defeating the purpose of the script. Not sure if this is possible.
Exclude lines with certain text (usernames), we want normal player data, not admin data.
I am sorry if did anything wrong, this is my first time on the site.
UPDATE: The admin data would be stored in a file called "admins.txt". By "alphabetizing" i meant it, example: Player A joins at 06:00, Player B joins at 06:30, then, Player A leaves at 06:45, Player B leaves at 07:00. If the data was flat, it would end up reading something like: A: 6:00, B: 6:30, A:6:45, B:7:00. But I would rather it be: A: 6:00, A: 6:45, B: 6:30, B: 7:00. That would make it easier to chart it out and make a calculation. Sorry for the long text.
Also typical server logging looks like this:
[15:46:30] [User Authenticator #1/INFO]: UUID of player DraconicPiggy is (UUID)
[15:46:31] [Server thread/INFO]: DraconicPiggy[/(Ip address)] logged in with entity id 157 at ([world]342.17291451961574, 88.0, -32.04791955684438)
The following awk script will report only on the two line types that you mentioned.
/User Authenticator|Disconnected/ {
print
}
I'm guessing "alpabetize" means sort. If so then you can pass the awk output to sort via a pipe.
awk -f script.awk | sort
I'm assuming the file is a log that's already in date-time sequence, with the timestamps at the start of the line. In this case you'll need to tell sort what to sort on.sort /? will tell you how to do this.
Multiple input files
To process all log files in the current directory use:
awk -f script.awk *.log
Redirect output to file
The simplest way is by adding > filtered.log to the command, like this:
awk -f script.awk *.log > filtered.log
That will filter all the input files into a single output file. If you need to write one filtered log for each input file then a minor script change is needed:
/User Authenticator|Disconnected/ {
print >> FILENAME ".filtered.log"
}
Filtering admins and redirecting to several files
The admins file should be similar to this:
Admin
DarkAdmin
PinkAdmin
The admin names must not contain spaces. i.e DarkAdmin is OK, but Dark Admin woud not work. Similarly the user names in your log files must not contain spaces for this script to work.
Execute the following script with this command:
awk -f script.awk admins.txt *.log
Probably best to make sure the log files and the filtered output are in separate directories.
NR == FNR {
admins[ $1 ] = NR
next
}
/User Authenticator|Disconnected/ {
if ( $8 in admins ) next
print >> FILENAME ".filtered.log"
}
The above script will:
Ignore all lines that mention an admin.
Creat a filtered version of every log file. i.e. if there are 5 log files then 5 filtered logs will be created.
Sorting the output
You have two sort keys in the file, the user and the time. This is beyone the capabilities of the standard Windows sort program, which seens very primitive. Yo should be able to do it with Gnu Sort:
sort --stable --key=8 test.log > sorted_test.log
Where:
--key=8 tells it to sort on field 8 (user)
--stable keeps the files in date order within each user
Example of sorting a log file and displaying the result:
terry#Envy:~$ sort --stable --key=8 test.log
[15:23:30] [User Authenticator #1/INFO]: UUID of player Doris is (UUID)
[16:36:30] [User Disconnected #1/INFO]: UUID of player Doris is (UUID)
[15:46:30] [User Authenticator #1/INFO]: UUID of player DraconicPiggy is (UUID)
[16:36:30] [User Disconnected #1/INFO]: UUID of player DraconicPiggy is (UUID)
[10:24:30] [User Authenticator #1/INFO]: UUID of player Joe is (UUID)
terry#Envy:~$

What is the correct way to load prebuilt SQlite databases with or without PouchDB in a React App

Currently I'm writing a React App and struggling with a simple reading from a SQlite database.
Edit because of unclear question:
***The goal is to read from the database without any backend, because it needs to read from the database even when it is offline.
***I'm aiming for a ONE TIME file conversion, then just pouchdb queries offline. But I don't want to do it manually because there are around 6k+ registries.
***Or SQL queries from the browser without any APIs, but I need to support Internet Explorer, so WebSQL is not an option. I've tried sqlite3 library, but I can't make it work with Create React App.
The solution I tried was to use PouchDB for reading the file, but I'm coming to a conclusion that it is NOT possible to PRELOAD a SQlite file with PouchDB without using cordova (I'm not comfortable with it, I don't want any servers running), or even with some kind of adapter.
So is this the right way of doing things?
Is there any way that I would not loose my .db data, and have to convert it all of it manually?
Should I forget about supporting this features on IE?
Thanks :)
Try this:
sqlite3 example "DROP TABLE IF EXISTS some_table;";
sqlite3 example "CREATE TABLE IF NOT EXISTS some_table (id INTEGER PRIMARY KEY AUTOINCREMENT, anattr VARCHAR, anotherattr VARCHAR);";
sqlite3 example "INSERT INTO some_table VALUES (NULL, '1stAttr', 'AttrA');";
sqlite3 example "INSERT INTO some_table VALUES (NULL, '2ndAttr', 'AttrB');";
## Create three JSON fragment files
sqlite3 example ".output result_prefix.json" "SELECT '{ \"docs\": ['";
sqlite3 example ".output rslt.json" "SELECT '{ \"_id\": \"someTable_' || SUBSTR(\"000000000\" || id, LENGTH(\"000000000\" || id) - 8, 9) || '\", \"anattr\": \"' || anattr || '\", \"anotherattr\": \"' || anotherattr || '\" },' FROM some_table;";
sqlite3 example ".output result_suffix.json" "SELECT '] }'";
## strip trailing comma of last record
sed -i '$ s/.$//' rslt.json;
## concatenate to a single file
cat result_prefix.json rslt.json result_suffix.json > result.json;
cat result.json;
You should be able simply to paste the above lines onto the (unix) command line, seeing output:
{ "docs": [
{ "_id": "someTable_000000001", "anattr": "1stAttr", "anotherattr": "AttrA" },
{ "_id": "someTable_000000002", "anattr": "2ndAttr", "anotherattr": "AttrB" }
] }
If you have jq installed you can do instead ...
cat result.json | jq .
... obtaining:
{
"docs": [
{
"_id": "someTable_000000001",
"anattr": "1stAttr",
"anotherattr": "AttrA"
},
{
"_id": "someTable_000000002",
"anattr": "2ndAttr",
"anotherattr": "AttrB"
}
]
}
You'll find an example of how quickly to initialize PouchDB from JSON files in part 2 of the blog post Prebuilt databases with PouchDB.
So, if you have a CouchDB server available you can do the following;
export COUCH_DB=example;
export COUCH_URL= *** specify yours here ***;
export FILE=result.json;
## Drop database
curl -X DELETE ${COUCH_URL}/${COUCH_DB};
## Create database
curl -X PUT ${COUCH_URL}/${COUCH_DB};
## Load database from JSON file
curl -H "Content-type: application/json" -X POST "${COUCH_URL}/${COUCH_DB}/_bulk_docs" -d #${FILE};
## Extract database with meta data to PouchDB initialization file
pouchdb-dump ${COUCH_URL}/${COUCH_DB} > example.json
## Inspect PouchDB initialization file
cat example.json | jq .
Obviously you'll need some adaptations, but the above should give you no problems.
Since Couch/Pouch-DB are document-oriented DBs all records aka docs there are just JSON aka JS-objects. In my RN app when I met similar task I just put all docs I wanted to be "prepopulated" in PouchDB in an array of JS-objects, import it as module in my app and then write them during app init to PDB as necessarry docs. That's all prepopulation. How to export your SQL DB records to JSON - you decide, surely it depends on source DB structure and data logic you want to be in PDB.

openstack specific host vm launch

everyone!
I'm running openstack (devstack installed) on 4 compute nodes & 1 control node cluster.
Compute hosts: node1, node2, node3, node4.
How can I run VM(s) on specific host(s), for instance on node3?
Using horizon or euca-* tools.
Thanx!
Select a specific node to boot instances on:
http://docs.openstack.org/essex/openstack-compute/admin/content/specify-host-to-boot-instances-on.html
Admin account required
Essex version
You need to use the availability zone -z option in euca-run-instances. For example if you wanted to boot the same image on every compute host that you have.
HOSTS=`nova-manage service list | grep compute | grep -v XXX | grep -v disabled | sort | cut -f1 -d' '`
for host in $HOSTS; do
euca-run-instances -k my-keypair -z nova:$host my-ami-id
done
This little script assumes that you only have one "availability zone" called 'nova' (the default in devstack).
Note that this still works in Essex but only if you're the admin user.
You can check your availability zone using:
openstack availability zone list
Now to create an instance on node2 you gave:
nova boot --flavor 'm1.tiny' --image (image id) --nic net-id=(network id) --availability-zone nova:node2 instance_name

Resources