Please, observe:
C:\> $x = #(1)
C:\> $x = #($x,2)
C:\> $x = #($x,3)
C:\> $x = #($x,4)
C:\> $x = #($x,5)
C:\> $x.Length
2
C:\> #($x |% { $_ }).Length
3
C:\> $x
Length : 2
LongLength : 2
Rank : 1
SyncRoot : {System.Object[] 2, 3}
IsReadOnly : False
IsFixedSize : True
IsSynchronized : False
Count : 2
4
5
C:\>
I expected the pipeline to flatten the list. But it does not happen. What am I doing wrong?
I expected the pipeline to flatten the list.
PowerShell enumerates arrays (lists) and, generally speaking, (most) enumerable data types, i.e. it sends their elements one by one to the success output stream (for which the pipeline acts as a conduit).
If an element itself happens to be another enumerable, it is not also enumerated.
Conceptualizing output in PowerShell as object streams of indeterminate length is better than to think of it in terms of arrays (lists) and their flattening - see the bottom section.
Separately, during for-display formatting only, enumerables are enumerated two levels deep, so that visually you can't tell the difference between #(1, 2) (an array that is enumerated, resulting in two integers being output) and Write-Output -NoEnumerate #(1, 2) (an array that is output as a single object) - even though in terms of data output these commands differ.
This for-display-only nested enumeration stops at the second level, so that any element at that level is formatted as it itself, even if it happens to be yet another enumerable; e.g., in the case of an array, that array is formatted according to the usual PowerShell rules for a given object: because the .NET Array type has more than 4 properties, Format-List formatting is implicitly applied, resulting in a line-by-line display of Array instance's properties, as shown in the following example:
# The 2nd element - a nested array - is formatted as a whole.
PS> Write-Output -NoEnumerate #(1, #(2))
1
Length : 1
LongLength : 1
Rank : 1
SyncRoot : {2}
IsReadOnly : False
IsFixedSize : True
IsSynchronized : False
Count : 1
Applying the above to your specific example:
# Your nested array.
$x = #(1); $x = #($x,2); $x = #($x,3); $x = #($x,4); $x = #($x,5)
To visualize the resulting nested array, pass it to ConvertTo-Json:
PS> ConvertTo-Json -Depth 4 $x
[
[
[
[
[
1
],
2
],
3
],
4
],
5
]
This tells us that you've created a two-element array, whose first element happens to be a nested array, itself comprising two elements, the first of which contains another nested array.
Therefore, $x.Length outputs 2.
#($x | % { $_ }).Length outputs 3 for the following reason:
Sending $x to the pipeline operator | enumerates its elements, and outputting each element (via $_ in the % (ForEach-Object) script block) causes each element to also be enumerated, if it happens to be an array.
Thus, the elements of the two-element array nested inside the first element of $x were output individually, followed by 5, the second element of $x, resulting in a total of three output objects.
Capturing these output objects in an array (#(...)) creates a three-element array whose .Length reports 3.
The for-display representation resulting from outputting $x itself follows from the explanation above.
Background information:
The PowerShell pipeline is a conduit for the success output stream, which is a stream of objects of indeterminate length, and only when that stream is captured - in the context of assigning to a variable ($var = ...) or making command output participate in an expression (e.g. (...).Foo) - does the concept of arrays enter the picture:
If the output stream happens to contain just one object, that object is captured as-is.
Otherwise, the PowerShell engine - of necessity - captures the multiple objects in a collection, which is an [object[]]-typed array.
Thus, the output stream has no concept of arrays, and it ultimately leads to confusion to discuss it in term of arrays and their flattening.
Note that the output stream (a pipeline) isn't only used when using the pipeline operator (|) to explicitly pipe data between commands, ...
... it is also used implicitly by any single command to send its output to.
However, it is not used in an expression (alone), such as 1 + 2 or [int[]] (1, 2, 3) (a command is any cmdlet, function, script, script block, or external program) or when passing arguments to commands.
That said, if you send the result of an expression to a command via | (which only works if the expression is the first pipeline segment), a pipeline is again involved, and the usual enumeration behavior (discussed below) applies to the expression's result; e.g. [int[]] (1, 2, 3) | ForEach-Object { 1 + $_ }
Perhaps surprisingly, use of the #(...) and $(...) operators invariably involves a pipeline, so that even enclosing stand-alone expressions in them results in enumeration (and re-collection); e.g., #([int[]] (1, 2, 3)).GetType().Name reports Object[] ([object[]]), because the strongly typed [int[]] array was enumerated, and the results were collected in a regular PowerShell array.
The only exceptions are array literals such as #(1, 2, 3), where (in PowerShell version 5 and above) this behavior is optimized away.
By contrast, the (...) operator does not enumerate expression results.
By default, outputting an array to the success output stream (more generally, instances of most .NET types that implement the IEnumerable interface[1]), causes it to be enumerated, i.e. its elements are sent one by one.
As such, there is no guaranteed relationship between outputting an array and whether or not capturing the output also results in an (new) array.
Notably, outputting a single object is indistinguishable from outputting that single object wrapped in a (single-element) array.
Sending arrays (collections) as a whole to the pipeline requires additional effort:
Either use Write-Output -NoEnumerate
# -> 1, because only *one* object was output, the array *as a whole*
(Write-Output -NoEnumerate #(1, 2) | Measure-Object).Count
Or - more obscurely, but more efficiently, use the unary form of ,, the array constructor operator, in order to wrap an array in an aux., transitory array whose enumeration then outputs the original array as a single object:
(, #(1, 2) | Measure-Object).Count # -> 1
[1] For a summary of which types PowerShell considers enumerable - which both excludes select types that do implement IEnumerable and includes one that doesn't - see the bottom section of this answer.
Related
I'm currently learning PowerShell, starting with the basics, and I've gotten to arrays. More specifically, looping arrays. I noticed that when declaring an array by itself, it's just simply written as
$myarray = 1, 2, 3, 4, 5
However, when an array is being declared with the intention of it being looped, it is written as
$myarray = #(1, 2, 3, 4, 5)
Out of curiosity, I tried running the code for looping through the array both with and without the # sign just to see if it would work and it was displayed within the string I created in the exact same way for both.
My question is what is the purpose of the # sign? I tried looking it up, but couldn't find any results.
It's an alternative syntax for declaring static arrays, but there are some key details to understanding the differences in syntax between them.
#() is the array sub-expression operator. This works similarly to the group-expression operator () or sub-expression operator $() but forces whatever is returned to be an array, even if only 0 or 1 element is returned. This can be used inline wherever an array or collection type is expected. For more information on these operators, read up on the Special Operators in the documentation for PowerShell.
The 1, 2, 3, 4 is the list-expression syntax, and can be used anywhere an array is expected. It is functionally equivalent to #(1, 2, 3, 4) but with some differences in behavior than the array subexpression operator.
#( Invoke-SomeCmdletOrExpression ) will force the returned value to be an array, even if the expression returns only 0 or 1 element.
# Array sub-expression
$myArray = #( Get-Process msedge )
Note this doesn't have to be a single cmdlet call, it can be any expression, utilizing the pipeline how you see fit. For example:
# We have an array of fruit
$fruit = 'apple', 'banana', 'apricot', 'cherry', 'a tomato ;)'
# Fruit starting with A
$fruitStartingWithA = #( $fruit | Where-Object { $_ -match '^a' } )
$fruitStartingWithA should return the following:
apple
apricot
a tomato ;)
There's another way to force an array type and I see it often mentioned on Stack Overflow as a cool trick (which it is), but with little context around its behavior.
You can use a quirk of the list-expression syntax to force an array type, but there are two key differences between this and using the array sub-expression operator. Prefixing an expression or variable with a comma , will force an array to be returned, but the behavior changes between the two. Consider the following example:
# Essentially the same as #( $someVar )
$myArray1 = , $someVar
# This behaves differently, read below
$myArray2 = , ( Invoke-SomeCmdletOrExpression )
#() or prefixing a variable with , will flatten (another word often used here is unroll) the resulting elements into a single array. But with an expression you have to use the group-expression operator if you use the comma-prefix trick. Due to how the grouped expression is interpreted you will end up with an array consisting of one element.
It will not flatten any resulting elements in this case.
Consider the Get-Process example above. If you have three msedge processes running, $myArray.Count will show a count of 3, and you can access the individual processes using the array-index accessor $myArray[$i]. But if you do the same with $myArray2 in the second list-expression example above, $myArray2.Count will return a count of 1. This is essentially now a multi-dimensional array with a single element. To get the individual processes, you would now need to do $myArray2[0].Count to get the process count, and use the array-index accessor twice to get an individual process:
$myArray2 = , ( Get-Process msedge )
$myArray2.Count # ======> 1
# I have 32 Edge processes right now
$myArray2[0].Count # ===> 32
# Get only the first msedge process
$myArray[0][0] # =======> Handles NPM(K) PM(K) WS(K) CPU(s) Id SI ProcessName
# =======> ------- ------ ----- ----- ------ -- -- -----------
# =======> 430 19 101216 138968 74.52 3500 1 msedge
This can be unclear at first because printing $myArray2 to the output stream will show the same output result as $myArray from the first example, and $myArray1 in the second example.
In short, you want to avoid using the comma-prefix trick when you want to use an expression, and instead use the array sub-expression #() operator as this is what it is intended for.
Note: There will be times when you want to define a static array of arrays but you will be using list-expression syntax anyways, so the comma-prefix becomes redundant. The only counterpoint here is if you want to create an array with an array in the first element to add more arrays to it later, but you should be using a generic List[T] or an ArrayList instead of relying on one of the concatenation operators to expand an existing array (+ or += are almost always bad ideas on non-numeric types).
Here is some more information about arrays in PowerShell, as well as the Arrays specification for PowerShell itself.
The operator you're looking for documentation on consists not only of the # but the ( and ) too - together they make up #(), also known as the array subexpression operator.
It ensures that the output from whatever pipeline or expression you wrap in it will be an array.
To understand why this is useful, we need to understand that PowerShell tends to flatten arrays! Let's explore this concept with a simple test function:
function Test-PowerShellArray {
param(
$Count = 2
)
while($count--){
Get-Random
}
}
This function is going to output a number of random numbers - $Count numbers, to be exact:
PS ~> Test-PowerShellArray -Count 5
652133605
1739917433
1209198865
367214514
1018847444
Let's see what type of output we get when we ask for 5 numbers:
PS ~> $numbers = Test-PowerShellArray -Count 5
PS ~> $numbers.GetType()
IsPublic IsSerial Name BaseType
-------- -------- ---- --------
True True Object[] System.Array
Alright, so the resulting output that we've stored in $numbers is of type [Object[]] - this means we have an array which fits objects of type Object (any type in .NET's type system ultimately inherit from Object, so it really just means we have an array "of stuff", it could contain anything).
We can try again with a different count and get the same result:
PS ~> $numbers = Test-PowerShellArray -Count 100
PS ~> $numbers.GetType()
IsPublic IsSerial Name BaseType
-------- -------- ---- --------
True True Object[] System.Array
So far so good - we collected multiple output values from a function and ended up with an array, all is as expected.
But what happens when we only output 1 number from the function:
PS ~> $numbers = Test-PowerShellArray -Count 1
PS ~> $numbers.GetType()
IsPublic IsSerial Name BaseType
-------- -------- ---- --------
True True Int32 System.ValueType
Say what? Now we're getting System.Int32 - which is the type of the individual integer values - PowerShell noticed that we only received 1 output value and went "Only 1? I'm not gonna wrap this in an array, you can have it as-is"
For this reason exactly, you might want to wrap output that you intend to loop over (or in other ways use that requires it to be an array):
PS ~> $numbers = Test-PowerShellArray -Count 1
PS ~> $numbers.GetType()
IsPublic IsSerial Name BaseType
-------- -------- ---- --------
True True Int32 System.ValueType
PS ~> $numbers = #(Test-PowerShellArray -Count 1) # #(...) saves the day
PS ~> $numbers.GetType()
IsPublic IsSerial Name BaseType
-------- -------- ---- --------
True True Object[] System.Array
I want to write two things in Powershell.
For example;
We have a one list:
$a=#('ab','bc','cd','dc')
I want to write:
1 >> ab
2 >> bc
3 >> cd
4 >> dc
I want this to be dynamic based on the length of the list.
Thanks for helping.
Use a for loop so you can keep track of the index:
for( $i = 0; $i -lt $a.Count; $i++ ){
"$($i + 1) >> $($a[$i])"
}
To explain how this works:
The for loop is defined with three sections, separated by a semi-colon ;.
The first section declares variables, in this case we define $i = 0. This will be our index reference.
The second section is the condition for the loop to continue. As long as $i is less than $a.Count, the loop will continue. We don't want to go past the length of the list or you will get undesired behavior.
The third section is what happens at the end of each iteration of the loop. In this case we want to increase our counter $i by 1 each time ($i++ is shorthand for "increment $i by 1")
There is more nuance to this notation than I've included but it has no bearing on how the loop works. You can read more here on Unary Operators.
For the loop body itself, I'll explain the string
Returning an object without assigning to a variable, such as this string, is effectively the same thing as using Write-Output.
In most cases, Write-Output is actually optional (and often is not what you want for displaying text on the screen). My answer here goes into more detail about the different Write- cmdlets, output streams, and redirection.
$() is the sub-expression operator, and is used to return expressions for use within a parent expression. In this case we return the result of $i + 1 which gets inserted into the final string.
It is unique in that it can be used directly within strings unlike the similar-but-distinct array sub-expression operator and grouping operator.
Without the subexpression operator, you would get something like 0 + 1 as it will insert the value of $i but will render the + 1 literally.
After the >> we use another sub-expression to insert the value of the $ith index of $a into the string.
While simple variable expansion would insert the .ToString() value of array $a into the final string, referencing the index of the array must be done within a sub-expression or the [] will get rendered literally.
Your solution using a foreach and doing $a.IndexOf($number) within the loop does work, but while $a.IndexOf($number) works to get the current index, .IndexOf(object) works by iterating over the array until it finds the matching object reference, then returns the index. For large arrays this will take longer and longer with each iteration. The for loop does not have this restriction.
Consider the following example with a much larger array:
# Array of numbers 1 through 65535
$a = 1..65535
# Use the for loop to output "Iteration INDEXVALUE"
# Runs in 106 ms on my system
Measure-Command { for( $i = 0; $i -lt $a.Count; $i++ ) { "Iteration $($a[$i])" } }
# Use a foreach loop to do the same but obtain the index with .IndexOf(object)
# Runs in 6720 ms on my system
Measure-Command { foreach( $i in $a ){ "Iteration $($a.IndexOf($i))" } }
Another thing to watch out for is that while you can change properties and execute methods on collection elements, you can't change the element values of a non-collection collection (any collection not in the System.Concurrent.Collections namespace) when its enumerator is in use. While invisible, foreach (and relatedly ForEach-Object) implicitly invoke the collection's .GetEnumerator() method for the loop. This won't throw an error like in other .NET languages, but IMO it should. It will appear to accept a new value for the collection but once you exit the loop the value remains unchanged.
This isn't to say the foreach loop should never be used or that you did anything "wrong", but I feel these nuances should be made known before you do find yourself in a situation where a better construct would be appropriate.
Okey,
I fixed that;
$a=#('ab','bc','cd','dc')
$a.Length
foreach ($number in $a) {
$numberofIIS = $a.IndexOf($number)
Write-Host ($numberofIIS,">>>",$number)
}
Bender's answer is great, but I personally avoid for loops if at all possible. They usually require some awkward indexing into arrays and that ugly setup... The whole thing just ends up looking like hieroglyphics.
With a foreach loop it's our job to keep track of the index (which is where this answer differs from yours) but I think in the end it is more readable then a for loop.
$a = #('ab', 'bc', 'cd', 'dc')
# Pipe the items of our array to ForEach-Object
# We use the -Begin block to initialize our index variable ($x)
$a | ForEach-Object -Begin { $x = 1 } -Process {
# Output the expression
"$x" + ' >> ' + $_
# Increment $x for next loop
$x++
}
# -----------------------------------------------------------
# You can also do this with a foreach statement
# We just have to intialize our index variable
# beforehand
$x = 1
foreach ($number in $a){
# Output the expression
"$x >> $number"
# Increment $x for next loop
$x++
}
pracl is a sysinternal command that can be used to list the ACLs of a directory. I have a list of shares and I want to create a csv file such that for each ACL entry, I want the share path in one column and share permission in the next. I was trying to do that by using the following code
$inputfile = "share.txt"
$outputFile = "out.csv"
foreach( $path in Get-Content $inputfile)
{
$results=.\pracl.exe $path
{
foreach ($result in $results) {write-host $path,$line}
}
$objResult = [pscustomobject]#{
Path = $Path
Permission = $line
}
$outputArray += $objResult
$objresult
}
$outputArray | Export-Csv -Path $outputfile -NoTypeInformation
It failed with the following error :-
Method invocation failed because [System.Management.Automation.PSObject] does not contain a method named 'op_Addition'.
At C:\Users\re07393\1\sample.ps1:14 char:1
+ $outputArray += $objResult
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (op_Addition:String) [], RuntimeException
+ FullyQualifiedErrorId : MethodNotFound
Any suggestions ?
You're trying to create an array of [pscustomobject]s in your $outputArray variable iteratively, using +=, but you're not initializing $outputArray as an array - see the bottom section for an explanation of the resulting behavior.
Thus, the immediate solution to your problem is to do just that:
# Do this before your `foreach` loop, then `+=` will work for appending elements.
$outputArray = #()
However, using += to add to arrays is inefficient, because in reality a new array instance must be created every time, because arrays are immutable data structures. That is, every time += is used, PowerShell creates a new array instance behind the scenes to which the existing elements as well as the new element are copied.
A simpler and much more efficient approach is to let PowerShell create an array for you, by using the foreach loop as an expression and assigning it to a variable as a whole:
That is, whatever is output in every iteration of the loop is automatically collected by PowerShell:
A simplified example:
# Create an array of 10 custom objects
[array] $outputArray = foreach ($i in 1..10) {
# Create and implicitly output a custom object in each iteration.
[pscustomobject] #{
Number = $i
}
}
Note the use of type constraint [array] to the left of $outputArray, which ensures that the variable value is always an array, even if the loop happens to produce just one output object (in which case PowerShell would otherwise just store that object itself, and not wrap it in an array).
Note that you can similarly use for, if, do / while / switch statements as expressions.
In all cases, however, these statements can only serve as expressions by themselves; regrettably, using them as the first segment of a pipeline or embedding them in larger expressions does not work - see GitHub issue #6817.
As for what you tried:
$outputArray += $objResult
Since you didn't initialize $outputArray before the loop, the variable is implicitly created in the loop's first iteration:
If the LHS variable doesn't exist yet, += is effectively the same as =: that is, the RHS is stored as-is in the LHS variable, so that $outputArray now contains a [pscustomobject] instance.
In the second iteration, because $outputArray now has a value, += now tries to perform a type-appropriate + operation (such as numeric addition for numbers, and concatenation for strings), but no + (op_Addition()) operation is defined for type [pscustomobject], so the operation fails with the error message you saw.
Trying to make a script that request more info (group Id) if there are SCOM groups with identical names:
function myFunction {
[CmdletBinding()]
Param(
[Parameter(Mandatory=$true)]
[string[]]$ObjectName
)
foreach ($o in $ObjectName) {
$p = Get-SCOMGroup -DisplayName "$o" | select DisplayName
<#
if ($p contains more than one string) {
"Request group Id"
} else {
"do this"
}
#>
}
}
Need help with the functionality in the comment block.
Wrap the value in an array subexpression #() and count how many entries it has:
if(#($p).Count -gt 1){"Request group Id"}
Note: This answer complements Mathias R. Jessen's helpful answer.
Counting the number of objects returned by a command:
Mathias' answer shows a robust, PowerShell v2-compatible solution based on the array sub-expression operator, #().
# #() ensures that the output of command ... is treated as an array,
# even if the command emits only *one* object.
# You can safely call .Count (or .Length) on the result to get the count.
#(...).Count
In PowerShell v3 or higher, you can treat scalars like collections, so that using just (...).Count is typically enough. (A scalar is a single objects, as opposed to a collections of objects).
# Even if command ... returns only *one* object, it is safe
# to call .Count on the result in PSv3+
(...).Count
These methods are typically, but not always interchangeable, as discussed below.
Choose #(...).Count, if:
you must remain PSv2-compatible
you want to count output from multiple commands (separated with ; or newlines)
for commands that output entire collections as a single object (which is rare), you want to count such collections as 1 object.[1]
more generally, if you need to ensure that the command output is returned as a bona fide array, though note that it is invariably of type [object[]]; if you need a specific element type, use a cast (e.g., [int[]]), but note that you then don't strictly need the #(...); e.g.,
[int[]] (...) will do - unless you want to prevent enumeration of collections output as single objects.
Choose (...).Count, if:
only one command's output must be counted
for commands that output entire collections as a single object, you want to count the individual elements of such collections; that is, (...) forces enumeration of command output.[2]
for counting the elements of commands's output already stored in a variable - though, of course, you can then simply omit the (...) and use $var.Count
Caveat: Due to a longstanding bug (still present as of PowerShell Core 6.2.0), accessing .Count on a scalar fails while Set-StrictMode -Version 2 or higher is in effect - use #(...) in that case, but note that you may have to force enumeration.
To demonstrate the difference in behavior with respect to (rare) commands that output collections as single objects:
PS> #(Write-Output -NoEnumerate (1..10)).Count
1 # Array-as-single-object was counted as *1* object
PS> (Write-Output -NoEnumerate (1..10)).Count
10 # Elements were enumerated.
Performance considerations:
If a command's output is directly counted, (...) and #(...) perform about the same:
$arr = 1..1e6 # Create an array of 1 million integers.
{ (Write-Output $arr).Count }, { #(Write-Output $arr).Count } | ForEach-Object {
[pscustomobject] #{
Command = "$_".Trim()
Seconds = '{0:N3}' -f (Measure-Command $_).TotalSeconds
}
}
Sample output, from a single-core Windows 10 VM (the absolute timings aren't important, only that the numbers are virtually the same):
Command Seconds
------- -------
(Write-Output $arr).Count 0.352
#(Write-Output $arr).Count 0.365
By contrast, for large collections already stored in a variable, #(...) introduces substantial overhead, because the collection is recreated as a (new) array (as noted, you can just $arr.Count):
$arr = 1..1e6 # Create an array of 1 million integers.
{ ($arr).Count }, { #($arr).Count } | ForEach-Object {
[pscustomobject] #{
Command = "$_".Trim()
Seconds = '{0:N3}' -f (Measure-Command $_).TotalSeconds
}
}
Sample output; note how the #(...) solution is about 7 times slower:
Command Seconds
------- -------
($arr).Count 0.009
#($arr).Count 0.067
Coding-style considerations:
The following applies in situations where #(...) and (...) are functionally equivalent (and either perform the same or when performance is secondary), i.e., when you're free to choose which construct to use.
Mathias recommends #(...).Count, stating in a comment:
There's another reason to explicitly wrap it in this context - conveying intent, i.e., "We don't know if $p is a scalar or not, hence this construct".
My vote is for (...).Count:
Once you understand that PowerShell (v3 or higher) treats scalars as collections with count 1 on demand, you're free to take advantage of that knowledge without needing to reflect the distinction between a scalar and an array in the syntax:
When writing code, this means you needn't worry about whether a given command situationally may return a scalar rather than a collection (which is common in PowerShell, where capturing output from a command with a single output object captures that object as-is, whereas 2 or more output objects result in an array).
As a beneficial side effect, the code becomes more concise (and sometimes faster).
Example:
# Call Get-ChildItem twice, and, via Select-Object, limit the
# number of output objects to 1 and 2, respectively.
1..2 | ForEach-Object {
# * In the 1st iteration, $var becomes a *scalar* of type [System.IO.DirectoryInfo]
# * In the 2nd iteration, $var becomes an *array* with
# 2 elements of type [System.IO.DirectoryInfo]
$var = Get-ChildItem -Directory / | Select-Object -First $_
# Treat $var as a collection, which in PSv3+ works even
# if $var is a scalar:
[pscustomobject] #{
Count = $var.Count
FirstElement = $var[0]
DataType = $var.GetType().Name
}
}
The above yields:
Count FirstElement DataType
----- ------------ --------
1 /Applications DirectoryInfo
2 /Applications Object[]
That is, even the scalar object of type System.IO.DirectoryInfo reported its .Count sensibly as 1 and allowed access to "its first element" with [0].
For more about the unified handling of scalars and collections, see this answer.
[1] E.g., #(Write-Output -NoEnumerate 1, 2).Count is 1, because the Write-Output command outputs a single object - the array 1, 2 - _as a whole. Because only a single object is output, #(...) wraps that object in an array, resulting in , (1, 2), i.e. a single-element array whose first and only element is itself an array.
[2] E.g., (Write-Output -NoEnumerate 1, 2).Count is 2, because even though the Write-Output command outputs the array as a single object, that array is used as-is. That is, the whole expression is equivalent to (1, 2).Count. More generally, if a command inside (...) outputs just one object, that object is used as-is; if it outputs multiple objects, they are collected in a regular PowerShell array (of type [object[]]) - this is the same behavior you get when capturing command output via a variable assignment ($captured = ...).
I have a very large JSON file containing an array. Is it possible to use jq to split this array into several smaller arrays of a fixed size? Suppose my input was this: [1,2,3,4,5,6,7,8,9,10], and I wanted to split it into 3 element long chunks. The desired output from jq would be:
[1,2,3]
[4,5,6]
[7,8,9]
[10]
In reality, my input array has nearly three million elements, all UUIDs.
There is an (undocumented) builtin, _nwise, that meets the functional requirements:
$ jq -nc '[1,2,3,4,5,6,7,8,9,10] | _nwise(3)'
[1,2,3]
[4,5,6]
[7,8,9]
[10]
Also:
$ jq -nc '_nwise([1,2,3,4,5,6,7,8,9,10];3)'
[1,2,3]
[4,5,6]
[7,8,9]
[10]
Incidentally, _nwise can be used for both arrays and strings.
(I believe it's undocumented because there was some doubt about an appropriate name.)
TCO-version
Unfortunately, the builtin version is carelessly defined, and will not perform well for large arrays. Here is an optimized version (it should be about as efficient as a non-recursive version):
def nwise($n):
def _nwise:
if length <= $n then . else .[0:$n] , (.[$n:]|_nwise) end;
_nwise;
For an array of size 3 million, this is quite performant:
3.91s on an old Mac, 162746368 max resident size.
Notice that this version (using tail-call optimized recursion) is actually faster than the version of nwise/2 using foreach shown elsewhere on this page.
The following stream-oriented definition of window/3, due to Cédric Connes
(github:connesc), generalizes _nwise,
and illustrates a
"boxing technique" that circumvents the need to use an
end-of-stream marker, and can therefore be used
if the stream contains the non-JSON value nan. A definition
of _nwise/1 in terms of window/3 is also included.
The first argument of window/3 is interpreted as a stream. $size is the window size and $step specifies the number of values to be skipped. For example,
window(1,2,3; 2; 1)
yields:
[1,2]
[2,3]
window/3 and _nsize/1
def window(values; $size; $step):
def checkparam(name; value): if (value | isnormal) and value > 0 and (value | floor) == value then . else error("window \(name) must be a positive integer") end;
checkparam("size"; $size)
| checkparam("step"; $step)
# We need to detect the end of the loop in order to produce the terminal partial group (if any).
# For that purpose, we introduce an artificial null sentinel, and wrap the input values into singleton arrays in order to distinguish them.
| foreach ((values | [.]), null) as $item (
{index: -1, items: [], ready: false};
(.index + 1) as $index
# Extract items that must be reused from the previous iteration
| if (.ready | not) then .items
elif $step >= $size or $item == null then []
else .items[-($size - $step):]
end
# Append the current item unless it must be skipped
| if ($index % $step) < $size then . + $item
else .
end
| {$index, items: ., ready: (length == $size or ($item == null and length > 0))};
if .ready then .items else empty end
);
def _nwise($n): window(.[]; $n; $n);
Source:
https://gist.github.com/connesc/d6b87cbacae13d4fd58763724049da58
If the array is too large to fit comfortably in memory, then I'd adopt the strategy suggested by #CharlesDuffy -- that is, stream the array elements into a second invocation of jq using a stream-oriented version of nwise, such as:
def nwise(stream; $n):
foreach (stream, nan) as $x ([];
if length == $n then [$x] else . + [$x] end;
if (.[-1] | isnan) and length>1 then .[:-1]
elif length == $n then .
else empty
end);
The "driver" for the above would be:
nwise(inputs; 3)
But please remember to use the -n command-line option.
To create the stream from an arbitrary array:
$ jq -cn --stream '
fromstream( inputs | (.[0] |= .[1:])
| select(. != [[]]) )' huge.json
So the shell pipeline might look like this:
$ jq -cn --stream '
fromstream( inputs | (.[0] |= .[1:])
| select(. != [[]]) )' huge.json |
jq -n -f nwise.jq
This approach is quite performant. For grouping a stream of 3 million items into groups of 3 using nwise/2,
/usr/bin/time -lp
for the second invocation of jq gives:
user 5.63
sys 0.04
1261568 maximum resident set size
Caveat: this definition uses nan as an end-of-stream marker. Since nan is not a JSON value, this cannot be a problem for handling JSON streams.
here's a simple one that worked for me:
def chunk(n):
range(length/n|ceil) as $i | .[n*$i:n*$i+n];
example usage:
jq -n \
'def chunk(n): range(length/n|ceil) as $i | .[n*$i:n*$i+n];
[range(5)] | chunk(2)'
[
0,
1
]
[
2,
3
]
[
4
]
bonus: it doesn't use recursion and doesn't rely on _nwise, so it also works with jaq.
The below is hackery, to be sure -- but memory-efficient hackery, even with an arbitrarily long list:
jq -c --stream 'select(length==2)|.[1]' <huge.json \
| jq -nc 'foreach inputs as $i (null; null; [$i,try input,try input])'
The first piece of the pipeline streams in your input JSON file, emitting one line per element, assuming the array consists of atomic values (where [] and {} are here included as atomic values). Because it runs in streaming mode it doesn't need to store the entire content in memory, despite being a single document.
The second piece of the pipeline repeatedly reads up to three items and assembles them into a list.
This should avoid needing more than three pieces of data in memory at a time.