Can a shake rule determine which "needs" have changed since the last build? - shake-build-system

I am building a shake based build system for a large Ruby (+ other things) code base, but I am struggling to deal with Ruby commands that expect to be passed a list of files to "build".
Take Rubocop (a linting tool). I can see three options:
need all Ruby files individually; if they change, run rubocop against the individual file that changed for each file that changed (very slow on first build or if many ruby files change because rubocop has a large start up time)
need all Ruby files; if any change, run rubocop against all the ruby files (very slow if only one or two files have changed because rubocop is slow to work out if a file has changed or not)
need all Ruby files; if any change, pass rubocop the list of changed dependencies as detected by Shake
The first two rules are trivial to build in shake, but my problem is I cannot work out how to represent this last case as a shake rule. Can anyone help?

There are two approaches to take with Shake, using batch or needHasChanged. For your situation I'm assuming rubocop just errors out if there are lint violations, so a standard one-at-a-time rule would be:
"*.rb-lint" %> \out -> do
need [out -<.> "rb"]
cmd_ "rubocop" (out -<.> "rb")
writeFile' out ""
Use batch
The function batch describes itself as:
Useful when a command has a high startup cost - e.g. apt-get install foo bar baz is a lot cheaper than three separate calls to apt-get install.
And the code would be roughly:
batch 3 ("*.rb-lint-errors" %>)
(\out -> do need [out -<.> "rb"]; return out) $
(\outs -> do cmd_ "rubocop" [out -<.> "rb" | out <- outs]
mapM_ (flip writeFile' "") pits)
Use needHasChanged
The function needHasChanged describes itself as:
Like need but returns a list of rebuilt dependencies since the calling rule last built successfully.
So you would write:
"stamp.lint" *> \out -> do
changed <- needHasChanged listOfAllRubyFiles
cmd_ "rubocop" changed
writeFile' out ""
The advantage of batch is that it is able to run multiple batches in parallel, and you can set a cap on how much to batch. In contrast needHasChanged is simpler but is very operational. For many problems, both are reasonable solutions. Both these functions are relatively recent additions to Shake, so make sure you are using 0.17.2 or later, to ensure it has all the necessary bug fixes.


need "ensure dependency is up to date"

I was watching Neil's discussing shake at ICFP. He mentions in the talk that the need function ensures that the dependency is "up to date". What does this mean exactly? Below is the code used in the talk:
"Foo.o" *> \_ -> do
need ["Foo.c"]
system' "gcc" ["-c", "Foo.c"]
Does this mean that the Shake framework expects there to be a "rule" on how to build "Foo.c", and will run that rule when figuring out if it needs to re-run the rule for building "Foo.o"? If that is the case, does Shake in essence have a map from File to Rule? What happens when my dependency is a file that simply exists on my system? If Shake is not used to generate it, and I use need ["Somefile.txt"], no rule will exist for how to build "Somefile.txt". Will Shake crash? At the root of it all, we have to start from some files that already exist.
P.S. I am new to build systems and to Shake; any guidance is appreciated.
A dependency is "up to date" if all its dependencies are up to date, and it has been run with those dependencies in their current value. But the important point in this question seems to be that Foo.o in Shake can refer to two things:
There can be a rule "Foo.o" *> which runs some commands, probably depending on source files, and produces an output file Foo.o.
If there are no rules to produce Foo.o, then Shake assumes Foo.o is a source file. At the leaves there must be files that are source files.
You can see this in the error message Shake produces:
$ shake shakeOptions $ action $ need ["hello.txt"]
Error, file does not exist and no rule available:
The fact that rules are named after the file they produced, and the absence of a rule implies it's a source file is shared with build systems like Make. However, this property is different from build systems like Buck/Bazel where targets and sources have distinct namespaces.

Common Lisp Package Definition with Dependencies for Exploration at the REPL?

This is another take at the question of packages and systems especially for usage locally in personal projects (maybe similiar to HOWTO definition and usage of Common Lisp packages (libraries)? from 2014). What are recommendations for how to approach this and can links be provided for sources that describe the recommended approach in 2022? CL's long half-life means lots of old blogs and books are still useful, but unfortunately not all of them, and it is hard for the novice to know which are the still viable guides.
Here is the start of a file that I might write for a simple personal exploration project. My basic working mode would be to start slime, load and compile this in emacs, and then change to the slime repl and use in-package to move into my package and start playing with my functions.
(ql:quickload "alexandria")
(ql:quickload "trivia")
(defpackage computing-with-cognitive-states
(:use cl)
(:nicknames :cwcs)
(:local-nicknames (:alex :alexandria)))
(in-package :cwcs)
(defun sum-denom (dimension dist) ... ; and more defun's
The leading ql:quickload's strike me as ugly, but without them I can't load and compile. Is it recommended that I do this manually in slime repl first? Or is this considered an acceptable practice for small use cases like this? Or is there another recommended way?
If I am thinking of using my package elsewhere someday is it recommended that I make it a system, perhaps with quickproject? Or is it the more traditional approach to just load a fasl file from one directory on my system when another session (or another package definition) needs it?
In summary, the questions are about the writing of code that is for personal use, might be a small file, or might be a few files that depend on each other and a few external systems. What are some of the reasonable approaches and resources that describe those approaches? And which are the ones that would be easiest to scale up if the project grew into something that one wanted to share?
For personal one-off »scripts«, I have a little macro in my .sbclrc:
(in-package #:cl-user)
(defmacro ql-require (&rest system-names)
`(eval-when (:compile-toplevel :load-toplevel :execute)
(ql:quickload ',system-names)))
Then my script file looks like this:
(in-package cl-user) ; naked symbols, I don't care much about package pollution here
(ql-require "alexandria" "arrows" …)
(defpackage my-script
(:use cl alexandria arrows))
(in-package my-script)
;; script away!
I open these in Emacs/SLIME, C-c C-k. This often just has the »main« at toplevel, so it will just print at the REPL. In other cases, I hit C-c ~ afterwards to work from the REPL.
As soon as it's not just a script, but something I might want to quickload entirely, I create a simple .asd file for it, which just moves the systems I depend on from the ql-require in the file to the :dependencies in the system definition. The package definition can remain in the one .lisp file. All my common lisp projects are under ~/common-lisp/, which is in the ASDF load path by default. A very simple .asd file can look like this:
(in-package #:asdf-user) ; uninterned symbols as string designators avoid package pollution
(defsystem "my-system" ; system names are lower-case strings
:dependencies ("alexandria" "arrows" …)
:serial t
:components ((:file "my-file")))
The filename for the system definition should be the same as the system name (my-system.asd here). If you put multiple systems into a single file, their names should share the filename as a prefix, optionally followed by a slash and some qualifying suffix (e. g. "my-system" and "my-system/test"). This ensures that ASDF can quickly find the right file to load without having to load it first.
As soon as I split the functionality into several files, I will usually also put the defpackage into its own file.
As soon as I create multiple packages, I will usually create subdirectories for each of them (:modules in :components in the system definition). (ASDF has an option to make a so-called package inferred system, but I prefer the central system definition.)
If you want to share a system, you'd probably want to add an :author and a :license to the system definition (even if it's just CC0/Public Domain). If you think that it's of general interest, you can submit it to Quicklisp.
The simple solution would indeed be to define a system, using ASDF. This is really all that quickproject does, but it is (IMO) a perfectly reasonable solution even for smaller projects.
What you would typically do, assuming you place your code in a somewhat standard location on your system (i.e. you do not have lisp projects everywhere on your file system !), is create a symlink from~/quicklisp/local-projects/ to your/personal/projects/directory. This way, after creating a project with quickproject:make-project (or writing the .asd file by hand ...) in this repository, Quicklisp will be able to find it and you can then quickload it directly, along with its dependencies if you have specified those in the .asd file.
To make things clearer: Quicklisp is "simply" a tool built on top of ASDF which is able to download (and then load) dependencies, and their dependencies ... and so on, if they are specified in a .asd file located in a place it knows. An "equivalent" could be Python's pip. On the other hand, ASDF specifies a way to define systems, that is, a set of source files that should be compiled together, in some specific order, using some dependencies, and so on. On a lower level, this would be more like C/C++'s make or CMake.

Automatically find dependencies and create CMakeLists.txt with CMake (or CMake Tools in Visual Studio Code) [duplicate]

CMake offers several ways to specify the source files for a target.
One is to use globbing (documentation), for example:
Another method is to specify each file individually.
Which way is preferred? Globbing seems easy, but I heard it has some downsides.
Full disclosure: I originally preferred the globbing approach for its simplicity, but over the years I have come to recognise that explicitly listing the files is less error-prone for large, multi-developer projects.
Original answer:
The advantages to globbing are:
It's easy to add new files as they
are only listed in one place: on
disk. Not globbing creates
Your CMakeLists.txt file will be
shorter. This is a big plus if you
have lots of files. Not globbing
causes you to lose the CMake logic
amongst huge lists of files.
The advantages of using hardcoded file lists are:
CMake will track the dependencies of a new file on disk correctly - if we use
glob then files not globbed first time round when you ran CMake will not get
picked up
You ensure that only files you want are added. Globbing may pick up stray
files that you do not want.
In order to work around the first issue, you can simply "touch" the CMakeLists.txt that does the glob, either by using the touch command or by writing the file with no changes. This will force CMake to re-run and pick up the new file.
To fix the second problem you can organize your code carefully into directories, which is what you probably do anyway. In the worst case, you can use the list(REMOVE_ITEM) command to clean up the globbed list of files:
file(GLOB to_remove file_to_remove.cpp)
list(REMOVE_ITEM list ${to_remove})
The only real situation where this can bite you is if you are using something like git-bisect to try older versions of your code in the same build directory. In that case, you may have to clean and compile more than necessary to ensure you get the right files in the list. This is such a corner case, and one where you already are on your toes, that it isn't really an issue.
The best way to specify sourcefiles in CMake is by listing them explicitly.
The creators of CMake themselves advise not to use globbing.
(We do not recommend using GLOB to collect a list of source files from your source tree. If no CMakeLists.txt file changes when a source is added or removed then the generated build system cannot know when to ask CMake to regenerate.)
Of course, you might want to know what the downsides are - read on!
When Globbing Fails:
The big disadvantage to globbing is that creating/deleting files won't automatically update the build-system.
If you are the person adding the files, this may seem an acceptable trade-off, however this causes problems for other people building your code, they update the project from version-control, run build, then contact you, complaining that"the build's broken".
To make matters worse, the failure typically gives some linking error which doesn't give any hints to the cause of the problem and time is lost troubleshooting it.
In a project I worked on we started off globbing but got so many complaints when new files were added, that it was enough reason to explicitly list files instead of globbing.
This also breaks common git work-flows(git bisect and switching between feature branches).
So I couldn't recommend this, the problems it causes far outweigh the convenience, when someone can't build your software because of this, they may loose a lot of time to track down the issue or just give up.
And another note, Just remembering to touch CMakeLists.txt isn't always enough, with automated builds that use globbing, I had to run cmake before every build since files might have been added/removed since last building *.
Exceptions to the rule:
There are times where globbing is preferable:
For setting up a CMakeLists.txt files for existing projects that don't use CMake.Its a fast way to get all the source referenced (once the build system's running - replace globbing with explicit file-lists).
When CMake isn't used as the primary build-system, if for example you're using a project who aren't using CMake, and you would like to maintain your own build-system for it.
For any situation where the file list changes so often that it becomes impractical to maintain. In this case it could be useful, but then you have to accept running cmake to generate build-files every time to get a reliable/correct build (which goes against the intention of CMake - the ability to split configuration from building).
* Yes, I could have written a code to compare the tree of files on disk before and after an update, but this is not such a nice workaround and something better left up to the build-system.
In CMake 3.12, the file(GLOB ...) and file(GLOB_RECURSE ...) commands gained a CONFIGURE_DEPENDS option which reruns cmake if the glob's value changes.
As that was the primary disadvantage of globbing for source files, it is now okay to do so:
# Whenever this glob's value changes, cmake will rerun and update the build with the
# new/removed files.
add_executable(my_target ${sources})
However, some people still recommend avoiding globbing for sources. Indeed, the documentation states:
We do not recommend using GLOB to collect a list of source files from your source tree. ... The CONFIGURE_DEPENDS flag may not work reliably on all generators, or if a new generator is added in the future that cannot support it, projects using it will be stuck. Even if CONFIGURE_DEPENDS works reliably, there is still a cost to perform the check on every rebuild.
Personally, I consider the benefits of not having to manually manage the source file list to outweigh the possible drawbacks. If you do have to switch back to manually listed files, this can be easily achieved by just printing the globbed source list and pasting it back in.
You can safely glob (and probably should) at the cost of an additional file to hold the dependencies.
Add functions like these somewhere:
# Compare the new contents with the existing file, if it exists and is the
# same we don't want to trigger a make by changing its timestamp.
function(update_file path content)
set(old_content "")
if(EXISTS "${path}")
file(READ "${path}" old_content)
if(NOT old_content STREQUAL content)
file(WRITE "${path}" "${content}")
# Creates a file called CMakeDeps.cmake next to your CMakeLists.txt with
# the list of dependencies in it - this file should be treated as part of
# CMakeLists.txt (source controlled, etc.).
function(update_deps_file deps)
set(deps_file "CMakeDeps.cmake")
# Normalize the list so it's the same on every machine
foreach(dep IN LISTS deps)
list(APPEND rel_deps ${rel_dep})
list(SORT rel_deps)
# Update the deps file
set(content "# generated by make process\nset(sources ${rel_deps})\n")
update_file(${deps_file} "${content}")
# Include the file so it's tracked as a generation dependency we don't
# need the content.
And then go globbing:
file(GLOB_RECURSE sources LIST_DIRECTORIES false *.h *.cpp)
add_executable(test ${sources})
You're still carting around the explicit dependencies (and triggering all the automated builds!) like before, only it's in two files instead of one.
The only change in procedure is after you've created a new file. If you don't glob the workflow is to modify CMakeLists.txt from inside Visual Studio and rebuild, if you do glob you run cmake explicitly - or just touch CMakeLists.txt.
Specify each file individually!
I use a conventional CMakeLists.txt and a python script to update it. I run the python script manually after adding files.
See my answer here:
I'm not a fan of globbing and never used it for my libraries. But recently I've looked a presentation by Robert Schumacher (vcpkg developer) where he recommends to treat all your library sources as separate components (for example, private sources (.cpp), public headers (.h), tests, examples - are all separate components) and use separate folders for all of them (similarly to how we use C++ namespaces for classes). In that case I think globbing makes sense, because it allows you to clearly express this components approach and stimulate other developers to follow it. For example, your library directory structure can be the following:
/include - for public headers
/src - for private headers and sources
/tests - for tests
You obviously want other developers to follow your convention (i.e., place public headers under /include and tests under /tests). file(glob) gives a hint for developers that all files from a directory have the same conceptual meaning and any files placed to this directory matching the regexp will also be treated in the same way (for example, installed during 'make install' if we speak about public headers).

Is it possible for Shake to change a source file?

When running tools such as formatters and linting tools with "auto-correction" options, it can be that the input and output for a Rule are the same file; for example:
"//*.hs" %> \out ->
cmd_ "ormolu" "-m" "inplace" out
-- OR
batch 10 ("//*.hs" %>)
( \out -> do
cmd_ "ormolu" "-m" "inplace" out
pure out
(cmd_ "hlint")
This seems to work "correctly" (the rule is re-run if the source file is needed and has changed), but we're unsure if this is a happy coincidence or shake working as designed - especially when we start thinking about cached results from shakeShare or in the future Cloud Shake. Is this the best way to handle this type of rule, or is there something better?
There is no principled way to generate a rule that replaces a source file in Shake. Given a source code formatter, anything else isn't very usfeul. Shake makes the assumption that inputs don't change while the compilation is ongoing. It's likely that passing --lint will lead to a lint error and that it would be incompatible with Cloud Shake. The official advice would be to make such changes in a separate non-Shake pass before you call shake.
However, if it works for you, and is useful, I wouldn't overly worry. The pattern has tests in Shake, it's something plenty of people do. You can turn off Cloud caching on a per file basis with historyDisable.

Distributing loadable builtin bash modules

I've written a built-in for bash which modifies the 'cd' command, a requirement for my software. Is there a way to actually distribute a loadable independently of bash itself? I'd ideally like to distribute just a drop in "additional feature" because I know people can be put off by patching and compiling their shell from source code.
I want to time how long a user is in a directory so I can determine where they want to be. It's this functionality: rewritten as a bash builtin, for performance issues. This implementation uses $PROMPT_COMMAND to work but I wanted something integrated.
It is unclear what you have modified but in any case, bash (like at least ksh93 which IIRC introduced the concept and zsh) supports, using the enable -f file name syntax, loading built-in functions as external dynamically loaded modules.
These modules being plain files can certainly be distributed independently, as long as you make sure they are compatible with the target version/architecture. This was already true 5 years ago when you asked this question.
One issue in your case is there seems to be no documented way to overload a internal built-in like cd by a dynamically loaded one while keeping the ability to access the former.
A simple workaround would be to implement your customized cd with a different name, say mycd, like this:
int mycd_builtin(list)
WORD_LIST *list;
int rv;
char wd[PATH_MAX+1];
// do your custom stuff knowing the new working directory
return (rv);
then to use an alias, or better, a shell function for your customized version to be used instead of the regular one:
cd() {
mycd "$#"
As long as your customization doesn't affect the behavior of the standard command and thus doesn't risk breaking scripts using it, there is nothing wrong in your approach.
Changing the built-in cd is a support nightmare for any admin and unwelcome to foreign users. What is wrong with naming it 'smart-cd' and letting the USER decide if they want the functionality by including it in their .bashrc or .profile? Then they can setup things however they want.
Also, using how long you've been in a directory is a pretty poor indication of preference. How would you distinguish between idling (a forgotten shell hanging in /tmp overnight), long-running scripts (nightly cron jobs), and actual activity.
There are a multitude of other methods for creating shortcuts to favorite directories: aliases, softlinks, $VARIABLES, scripts. It is arrogant of you to assume that your usage patterns will be welcomed by other users of your system.
