When the Manager was shutting down while the sleep scheduler was running, it
could cause a null pointer dereference. This is now doubly solved:
- `worker.Identifier()` is now nil-safe, as in, `worker` can be `nil` and
it will still return a sensible string.
- failure to apply the sleep schedule due to the context closing is not
logged as error any more.
Blender not being found can be reported via various errors (this should be
reworked in the 'blender finder API' at some point). `exec.ErrNotFound` is
returned when Blender cannot be found on `$PATH`, which is something that's
absolutely fine. This is now logged less dramatically.
Avoid the word "error" in logging when Blender cannot be found. Typically
these are warnings, and having the word "error" there makes people think
otherwise.
Fix T100757 by reducing the log level to "info" when Blender writes output
to a file format the Worker cannot handle. Such cases are expected, and
now no longer result in an error message.
When doing two-way variable replacement, if the variable has a Windows
path (i.e. backslashes) also do a match for the value with forward slashes.
In other words, if a path `Y:/shared/...` comes in, and the variable value
is (correctly) `Y:\shared\...`, it will be seen as a match.
When a submitted job is refused because of a mismatched etag, there is
now a more explanatory error logged on the Manager. The website also has
an entry in the FAQ for this, as I expect more people to run into this
issue when they upgrade Flamenco.
The `require.XXX` functions are exactly the same as `assert.XXX`
functions + directly failing the test, so this refactor simplifies the
code quite a bit. Can be done in more areas than this.
No functional changes.
Simple Blender Render now no longer renders to an intermediate directory.
This not only simplifies the script, but it also opens the door for
selective re-running of individual tasks.
In the old situation, where the intermediate directory was renamed to
the desired name in the last task, rerunning tasks would fail because the
directory they expect to exist no longer exists. This is now resolved.
The original idea behind this job type was that it would work equally
well for videos as for images, but that was never really well tested.
It's currently broken, so this commit removes video support altogether.
Remove the `blender_cmd` setting, and just hard-code it to `{blender}`.
The Blender add-on was already passing this string, and it's very unlikely
that people are already writing custom add-ons to pass something different.
It provided flexibility that was untested, so it's better to simplify
things.
Implement the `getSharedStorage` operation in the Manager, and use it in
the add-on to get the shared storage location in a way that makes sense
for the platform of the user.
Manifest task: T100196
This variable is used in tests to mock the current OS, but wasn't set
during normal operation of the Manager. This caused issues with the
two-way variable system.
Split "executable" from "its arguments" in blender & ffmpeg commands.
Use `{blenderArgs}` variable to hold the default Blender arguments,
instead of having both the executable and its arguments in `{blender}`.
The reason for this is to support backslashes in the Blender executable
path. These were interpreted as escape characters by the shell lexer.
The shell lexer based splitting is now only performed on the default
arguments, with the result that `C:\Program Files\Blender
Foundation\3.3\blender.exe` is now a valid value for `{blender}`.
This does mean that this is backward incompatible change, and that it
requires setting up Flamenco Manager again, and that older jobs will not
be able to be rerun.
It is recommended to remove `flamenco-manager.yaml`, restart Flamenco
Manager, and reconfigure via the setup assistant.
There are Flamenco "commands" and CLI "commands", and it's nice to be
explicit about which is which. I'm sure this is needed in some other
areas as well.
Don't change backslashes to forward slashes on Windows. Trying to use
forward slashes everywhere was a mistake, and this is one of the steps to
make it right.
Backslashes can be included in two ways, as-is (which works fine) and
between double quotes (in which case they need escaping). This test checks
for both.
Add a new `abspath(path)` function to the add-on, for use in job type
settings. With this, the "simple blender render" job can support relative
paths for the "render output root" setting, and still have an absolute
final "render output path".
Two-way variable replacement now also changes the path separators. Since
the two-way replacement is made for paths, it makes sense to also clean up
the path for the target platform.
Instead of sending the current process an interrupt signal, use a dedicated
channel to signal the wish to shut down. The main function responds to that
channel closing by performing the shutdown.
This solves an issue where the Worker would not cleanly shut down on
Windows when `offline` state was requested by the Manager.
Workers can now be soft-deleted. Tasks assigned to the worker will remain
associated with that Worker. Active tasks will be re-queued so other
workers can pick them up.
For some reason, calling `AssocQueryStringW` on Windows Home returns error
code 122, "The data area passed to a system call is too small", even when
the data area is large enough. Furthermore, the API actually describes that
in such cases `S_FALSE` is supposed to be returned, with `*pcchOut` set to
the required size. Because of this apparent violation of the documentation,
and because it just works, Flamenco now ignores this particular error and
just returns the obtained string.
Convert the database interface from the stdlib `database/sql` package to
the GORM object relational mapper.
GORM is also used by the Manager, and thus with this change both Worker
and Manager have a uniform way of accessing their databases.
The Manager now broadcasts a worker update to SocketIO clients when a
worker gets a new task assigned. This ensures the "current task" shown in
the worker details view is up to date.
The etag prevents job submissions with old settings, when the job
compiler script has been edited. The etag is the SHA1 hash of the
`JOB_TYPE` dictionary (as defined by the JavaScript file). The hash is
computed in a way that's independent of the exact formatting in the
JavaScript file. Also the actual JS code itself is irrelevant, just the
`JOB_TYPE` dictionary is used.
Blender cannot be told to only allow absolute path for an RNA property
(of type `string`, subtype `dir_path`), so as a workaround the final
`render_output_path` is now using `bpy.path.abspath()` to make the path
absolute.
This has as advantage that the render output path can be defined by artists
as a blendfile-relative path, and that it'll be resolved when submitting
the blend file.
Remove `--factory-startup` from the default Blender arguments. This makes
it simpler to configure each Worker to use its own GPU, without having to
inject Python code into the arguments.
Users can always add this when they need, but I think it's friendlier to
have Blender behave the same when they manually run it and when used by
Flamenco Worker.
Include `RELEASE_CYCLE` in the Makefile. This is mentioned at startup of
Manager and Worker, and reflects in the software version they report.
If `RELEASE_CYCLE == "release"`, Manager and Worker report their version
as `ApplicationVersion`. If it's any other string, the Git hash will get
appended.
The Worker now always waits for subprocesses. When faced with multiple
errors (like I/O reading from stdout and a returned error status from
the process) will return the most important one (in this case the exit
status of the process).
Subprocesses need to be waited for, even when they crashed, otherwise
they will linger around as "defunct" processes. This caused
out-of-memory errors, because several defunct Blenders were eating up
the memory.
Blender and FFmpeg were run in the same way, using copy-pasted code. This
is now abstracted away into the CLI runner, which in turn is moved into
its own subpackage.
No functional changes.
When a worker's tasks get requeued because it goes to sleep, the task log
will now mention the worker identification (name + UUID). This aids in
figuring out what happened to tasks.
Blender Finder now understands that directory paths should be suffixed
with `blender` (Linux, macOS) or `blender.exe` (Windows).
Giving the Setup Assistant a path like `C:\Program files\Blender
Foundation\Blender 3.2` will now just work. This is considerably simpler
for many users, as copy-pasting a directory from a file explorer is
simpler than obtaining/typing the path to the executable.
When compiled without OpenColorIO, Blender will first complain "Color
management: Error could not find role data role." before showing the
actual version number. This is now handled by looking for a "Blender "
prefix instead of just returning the first line of output.
This has as a side-effect that when no such line can be found, we know
it's not Blender, and thus an error can be returned (instead of the
version of whatever binary was being run).
The Task details component already linked to the Worker it was assigned
to last, and now the Worker links back to the task.
There's only one task shown in the Worker details. If the Worker is
actively working on a task, that one's shown. Otherwise it's the
last-updated task that was assigned to the worker.
This commit does not introduce functional changes, besides renaming
every mention of 'wizard' with 'setup assistant'. In order to run the
manager setup assistant use:
./flamenco-manager -setup-assistant
The change was introduced to favor more neutral and descriptive working
for this functionality. Thanks to Sybren for helping to get this done!
Flamenco now no longer uses the Git tags + hash for the application
version, but an explicit `VERSION` variable in the `Makefile`.
After changing the `VERSION` variable in the `Makefile`, run
`make update-version`.
Not every part of Flamenco looks at this variable, though. Most
importantly: the Blender add-on needs special handling, because that
doesn't just take a version string but a tuple of integers. Running
`make update-version` updates the add-on's `bl_info` dict with the new
version. If the version has any `-blabla` suffix (like `3.0-beta0`) it
will also set the `warning` field to explain that it's not a stable
release.
Remove the `{ffmpeg}` variable from the default configuration, and its use
from the job compiler scripts. Now that the Worker can find its bundled
FFmpeg, it's no longer needed to configure its location on the Manager.
Worker will now try one of the following paths, relative to the flamenco-worker
executable, in order to find FFmpeg. If they cannot be found, `$PATH` is
searched for FFmpeg.
- `tools/ffmpeg-$GOOS-$GOARCH`
- `tools/ffmpeg-$GOOS`
- `tools/ffmpeg`
On Windows these paths will have a `.exe` suffix appended. `$GOOS` is the
operating system, like "linux", "darwin", "windows", etc. `$GOARCH` is the
architecture, like "amd64", "386", etc.
By accident I made the worker load `flamenco-worker.yaml` from the "local
files" directory (~/.local/share/flamenco on Linux) instead of the current
directory. This was incorrect, as that file is meant to contain
configuration that's shared between workers.
* Replace "OK!" with "successfully"
Remove exclamation mark since there is no need to call for attention.
Use "successfully" as it is more descriptive in this case than OK,
which can have other meanings.
- Added initial description and illustration
- Swap "Check" button for fields with a debounced @input event
- Turn Blender's list into a radio selector
- Tweak wording when paths are not found
- Add microtip library for tooltips
- Make navigation steps clickable, according to the state
Two-way variable implementation in the job submission end-point. Where
Flamenco v2 did the variable replacement in the add-on, this has now
been moved to the Manager itself. The only thing the add-on needs to
pass is its platform, so that the right values can be recognised.
This also implements two-way replacement when tasks are handed out, such
that the `{jobs}` value gets replaced to a value suitable for the
Worker's platform as well.
Add a line to the task log whenever task changes status. This only applies
to directly-changed tasks, and not to mass-updates (like all tasks going
from 'completed' to 'queued' on a job requeue).
When a Worker changes state from `awake` to something else, it cannot
run tasks any more. This now triggers a requeue of its active task
(should be one at most, if things are sane) so that another worker can pick
it up.
On Windows, store files in `%LOCALAPPDATA%\Blender Foundation\Flamenco`.
Previously the `Blender Foundation` part of the path was missing.
Manifest Task: T99415
Change the location where the Worker writes its local files so that it
follows the XDG specification (instead of writing to the current working
directory).
- Linux: `$HOME/.local/share/flamenco`
- Windows: `C:\Users\UserName\AppData\Local\Flamenco`
- macOS: `$HOME/Library/Application Support/Flamenco`
NOTE: The old files will not be loaded any more. This means that if
nothing is done and the new worker is run as-is, it will reregister as a
brand new worker. Move `flamenco-worker-credentials.yaml` and
`flamenco-worker.sqlite` to the new location to avoid this.
SQLite often errors out on this with only `interrupted (9)` as message.
This logging should at least tell us whether it's our own "background
context" timing out, or whether something else fishy is going on.
Fix workers timing out when they're `asleep`. When sleeping, the Worker
will call the `WorkerState` operation to see if they have to wake up, but
that didn't mark the workers as "seen". As a result, a sleeping worker
would always time out.
Run some API operations in a background context. This should prevent some
of the SQLite "interrupted" errors, as those can occur when the context
closes while a query is running.
The API operations that Workers use are now mostly running in a separate
background context, at least from the moment onward when they can run
independently of the Worker connection.
Improve the "my own URLs" construction, such that:
- IPv6 link-local addresses are always skipped. They require a "zone index"
string, typically the interface name, so something like
`[fe80::cafe:f00d%eth0]`. This is not supported by web browsers, so the
URLs would be of limited use. Furthermore, they require the interface
name of the side initiating the connection, whereas this code is used to
answer the question "how can this machine be reached as a server?"
- IPv4 addresses are sorted before IPv6 addresses. Even though I like IPv6
a lot, IPv4 is still more familiar to people.
- Loopback addresses (::1, 127.0.0.1) are sorted last, so that the First-
Time Wizard is most likely to use the bigger-scoped address.
Every time the web interface starts, it queries the config to see whether
it should be in first-time-wizard mode or not. This caused unnecessary
info-level logging.
In the future it would be better to load the config file just once,
instead.
For some reason, on Windows, creating a directory with zero permissions
still allows creating a file in there. Just skip that part of the test.
The Explorer's properties panel of the directory also shows "Read Only
(only applies to files)", so at least that seems consistent.
Be more selective in what's saved to the database to speed some things up.
Most importantly, this avoids saving the entire job when a task status is
updated or a task is assigned.
Add a SHA256 password hasher for worker authentication. It's not used at
the moment, but can be switched to for faster API queries. Note that
switching will cause authentication errors on already-existing workers,
which means they'll automatically re-register.
This is mostly useful for debugging & profiling purposes.
Move the Worker password hashing/comparison functions into a struct, and
use it via an interface. This will make it easier to switch to different
hashing algorithms.
Even with a low number of iterations, BCrypt is quite slow. That's good for
security, but not for Flamenco Worker authentication -- the password is
more as "nice check to avoid accidentally reusing the same ID" than
something for security.
Build with `make stresser`. Run with:
./stresser -worker UUID -secret ABCXYZ
The worker ID and secret can be obtained from
`flamenco-worker-credentials.yaml`. If left empty, the stresser will
register as a new worker, and log the credentials to be used on the next
invocation.
In the first-time wizard, if Blender cannot be found on $PATH but it can
be found via .blend file association, that should just be reported as a
normal sitation, and not as a `500 Internal Server Error`.
Trigger the first-time wizard on first-time runs of Flamenco, by defaulting
the storage path to the empty string.
The wizard can always be triggered with the `-wizard` CLI argument. This is
just for detection of first-time / unconfigured runs.
This just updates the config and saves it to `flamenco-manager.yaml`.
Saving the configuration doesn't restart the Manager yet, that's for
another commit.
This adds a `-wizard` CLI option to the Manager, which opens a webbrowser
and shows the First-Time Wizard to aid in configuration of Flamenco.
This is work in progress. The wizard is just one page, and doesn't save
anything yet to the configuration.
Manager always creates an implicit variable `{jobs}`. This used to be
Shaman-dependent, but now it's always there (has been for a while). This
is now reflected in an add-on comment, and in an extra unit test.
The task logs storage system is refactored to use the `local_storage`
package. Configuration options have also changed:
- `task_logs_path` is renamed to `local_manager_storage_path`, to
emphasise that only the Manager deals with those files, with default
value `./flamenco-manager-storage`.
- `storage_path` is renamed to `shared_storage_path`, to emphasise this
is the storage shared between Manager and Workers, with default value
`./flamenco-shared-storage`.
Task logs are still stored in
`${local_manager_storage_path}/job-{jobUUID[0:4]}/{jobUUID}/task-{taskUUID}.txt`
Manifest task: T99409
Various config sections were commented out, because they were brought in
from Flamenco 2 but weren't implemented yet. These have now been removed,
as the basic functionality is there, and new functionality will likely
be different from Flamenco 2 anyway.
Add a `-withBlender` CLI argument for a unit test, to aid in debugging
T99438.
Run the test with `go test ./internal/worker/find_blender/ -args -withBlender`
to actually fail when the file association with `.blend` files cannot be
found.
Note that this doesn't rely on Blender being runnable, but it does rely
on _something_ being associated with .blend files.
In the `simple-blender-render` job type settings, hide the `chunk_size`
setting from the web frontend, and show the `blendfile` setting instead.
The actual blend file being rendered is important to know, whereas the
chunk size can be inferred from the task names anyway.
Add a "Last Rendered" view to the webapp.
The Manager now stores (in the database) which job was the last
recipient of a rendered image, and serves that to the appropriate
OpenAPI endpoint.
A new SocketIO subscription + accompanying room makes it possible for
the web interface to receive all rendered images (if they survive the
queue, which discards images when it gets too full).
After processing an image in the "last-rendered" processor, a SocketIO
object is sent to clients to indicate the last-rendered image needs to
be (re)loaded.
This also moves the previously existing "done callback" from a single
function to a per-image callback, so that it can be called with the
right information in there, and only when that particular image is
actually done processing.
The notification message sent via SocketIO also contains the necessary
info to render the image, so that the web client doesn't have to call
the `fetchJobLastRenderedInfo` operation.
`os.IsNotExist()` is from before `errors.Is()` existed. The latter is the
recommended approach, as it also recognised wrapped errors.
No functional changes, except for recognising more cases of "does not
exist" errors as such.
Workers now send output produced by Blender (limited to PNG and JPEG
images, currently) to Manager. This is done by converting to JPEG first,
then sending the bytes via the Flamenco API to the Manager.
Prepare the Worker for submission of last-rendered images to Manager, by
parsing `stdout` of Blender to see which files were saved.
This needs more work, as now just an error "not implemented" is logged.
The test can hang occasionally, and needs some love & attention. For now
I've done some patching to make it slightly better, but still disabled it
and added a `FIXME` note to it.
On Windows it's not allowed to erase a file while it's opened, which caused
this error to surface. The file is now properly closed before the test
file is erased.
Add a handler for the OpenAPI `taskOutputProduced` operation, and an
image thumbnailing goroutine.
The queue of images to process + the function to handle queued images
is managed by `last_rendered.LastRenderedProcessor`. This queue currently
simply allows 3 requests; this should be improved such that it keeps
track of the job IDs as well, as with the current approach a spammy job
can starve the updates from a more calm job.
Add a `local_storage` package that finds a suitable place to put files.
Currently it just looks at the location of the currently running
executable; it can later do other things. It can be queried for directory
to put job-specific files.
It is intended to be used by the under-development "last rendered output"
processing system, to store an image file per job. Later we should also
refactor the task log handling system to use this.
If there is a `scripts` directory next to the current executable, load
scripts from that directory as well.
It is still required to restart the Manager in order to pick up changes
to those scripts (including new/removed files), PLUS a refresh in the
add-on.
When on-disk job compiler scripts are supported, they will be reloaded
often, and it becomes more important to have the access to the map of
loaded job compilers thread-safe.
The global `scriptFS` variable was too easy to access, which caused an
issue where the mandatory `"scripts"` subdirectory was not passed.
Accessing via a getter function that hides this requirement prevents this.
Take some functions out of the `Service` struct, as they are more or less
standalone anyway. This will also make it easier later to make things
thread-safe, as that'll become important when files can get live-reloaded.
The `Load()` function returns a `*Service`, and it was confusing that the
local variable is named `compiler` instead. Now it's called `service`.
No functional changes.
Refactor the JS script file loading code so that it's tied to the `fs.FS`
interface for longer, and less to the specifics of our `embed.FS` instance.
This should make it possible to use other filesystems, like a real on-disk
one, to load scripts.
Allow overriding the worker name by setting the `FLAMENCO_WORKER_NAME`
environment variable. This makes it easy to do from Docker configs, and,
more importantly, from the scripts I use to run multiple workers on the
same machine while developing Flamenco.
Move the code related to task updates from workers to
`worker_task_updates.go`. It's going to get more complex with the
blocklisting in there; this prepares for that.
No functional changes.
When a job or task gets requeued from the web interface, its task
failure lists (i.e. the list of workers that previously failed this
task) will be cleared.
This clearing doesn't happen in other situations, e.g. when a worker
signs off and its task gets requeued, the task's failure list will
remain as-is.
Rename functions `onTaskStatusX` to `updateJobOnTaskStatusX` to clarify
their responsibility is to update the job in reaction to a task status
change.
No functional changes.
Introduce `ParameterMissingError` and `ParameterInvalidError` structs, to
be returned from command executors. These replace free-form `fmt.Errorf()`
style errors.
Fix sqlite issues in the "upstream buffer" test. The test used
`:memory:` to have an in-memory DB to separate from other tests. The
"flush at shutdown" code runs in a different goroutine, though, and
creates a new DB connection. The SQLite separation was too strong,
making that function not find any tables. This is now solved by having
an in-memory database that's shared between all connections made from
the same unit test.
Within the shutdown procedure, signing off is now the last thing the
worker does. This makes things more consistent from the Manager's point
of view (like receiving last-second log entries while the Worker is still
online).
This fixes a bug where 'Worker undefined changed status' was logged in
the web interface, as that was (back then incorrectly) `workerupdate.name`.
Now that code is correct.
Before checking whether the Worker is allowed to do work (i.e. is in
`awake` state), check any queued-up status changes. Those should be
communicated, before saying "no work for you", so that the Worker can
actually respond to it.
When a Worker indicates a task failed, mark it as `soft-failed` until
enough workers have tried & failed at the same task.
This is the first step in a blocklisting system, where tasks of an
often-failing worker will be requeued to be retried by others.
NOTE: currently the failure list of a task is NOT reset whenever it is
requeued! This will be implemented in a future commit, and is tracked in
`FEATURES.md`.
The persistence layer can now store which worker failed which task, as
preparation for a blocklisting system. Such a system should be able to
determine whether there are still any workers left to do the work.
This is needed for a future unit test, and exposed the fact that SQLite
didn't enforce foreign key constraints (and thus also didn't handle
on-delete-cascade attributes). This has been fixed in the previous commit.
When creating tasks the inter-task dependencies are saved as a 2nd pass,by
updating the tasks in the database. This now only saves those dependencies,
and no longer saves the entire task again.
`persistence.Model` contains the common database fields for most model
structs. It is a copy of `gorm.Model`, but without the `DeletedAt`
field (which triggers Gorm's soft deletion).
Soft deletion is not used by Flamenco. If it ever becomes necessary to
support soft-deletion, see https://gorm.io/docs/delete.html#Soft-Delete
When receiving a `TaskUpdate` from a Worker, write to the task log, before
handling any task status change.
If both log and task status change are sent, the log will likely contain
the cause of the task state change. Any subsequent task logs, for example
generated by the Manager in response to the status change, should be
logged after that.
The canary test asserts that certain constants still have the expected
value. Lowering those constants is good for testing the timeout stuff with
the actual Flamenco Manager + Worker (without having to wait 5 minutes for
it to kick in), but it's too easy to accidentally run the unit tests and
get cryptic errors about everything failing horribly and miserably when
you leave those constants low.
Update the 'last seen at' timestamp of workers when they:
- sign on
- sign off
- get a task assigned
- send a task update
- check whether they can keep running their task
Note that this commit is necessary to not have the workers time out
immediately ;-)
Requeueing the tasks of a specific worker is now done in the
`TaskStateMachine`, such that it can be called from other services as
well in future commits.
This also makes the `LogStorage` service a dependency of the
`TaskStateMachine`, as it needs to write "this task was requeued" kind
of messages to the task logs.
SQLite doesn't handle timezones by default, when you just use something
like `date1 < date2`, for example. This makes GORM explicitly use UTC
timestamps for the `CreatedAt`, `UpdatedAt`, and `DeletedAt` fields.
Our own code should also use UTC when saving timestamps. That way all
datetimes in the database are in the same timezone, and can be compared
naievely.
The `TestTaskTimeout()` unit test assumes specific durations for initial &
subsequent sleeps of the timeout checker. The test will fail quite
cryptically when that assumption doesn't hold, so just test for it at
the start of the unit test.
Tasks that are in state `active` but haven't been 'touched' by a Worker
for 10 minutes or longer will transition to state `failed`.
In the future, it might be better to move the decision about which state
is suitable to the Task State Machine service, so that it can be smarter
and take the history of the task into account. Going to `soft-failed`
first might be a nice touch.
In the future different services will write to the task log, and thus
it makes sense to move the responsibility of prepending the timestamps
to the log storage service.
The requeue-task-on-worker-signoff operation also needs to log a timestamp.
The code for this, and the recently added code for timestamping the
"task assigned to worker" message, are now unified.
A worker state change request can now be cancelled by requesting the worker
to go to its current state. In other words, a previously requested change
`A → B` can be cancelled by requesting the worker goes to state `A`.
Previously this would simply overwrite the last request, resulting in a
requested state change `A → A`. Having this non-lazy would even interrupt
the currently running task.
rFcfb17b178da2055ef12b2aa2ad8f7f778a952bc3 changed the semantics of
`SocketIOWorkerUpdate`, in the sense that any update that doesn't change
the worker status can omit `previous_status`. This commit adjusts the
unit test for this.