Appearance
systemslab-server
Configuration Reference
The server configuration file is always found at /etc/systemslab/server.toml
.
Options
postgres
This is the URL that the server will use to connect to the database. The format of this URL is the same as that accepted by libpq
. See the documentation here.
If not provided, this defaults to connecting to a database on the same machine on the default postgres port (5432).
controller
Whether this SystemsLab server is a controller.
The SystemsLab controller is responsible for scheduling experiments onto agents. There must only be one in a cluster. If there is not one then no experiments will be scheduled.
port
The port that the server will listen on. If not set then this is 3000.
max_experiment_jobs
The maximum number of jobs that is permitted in an experiment. Attempting to submit an experiment with more than this number of jobs will result in an error. By default, this is 32.
max_experiment_runs
The maximum number of times that an experiment can be retried if it fails due to an infrastructure error. By default, this is 8.
reap_interval
The interval, in seconds, at which the controller will check to see if any agents have disappeared without notifying the server.
By default, this is performed every 15 minutes.
realtime_metrics
Enable or disable storage of realtime metrics in the database. Disabling this will mean that certain metrics views in the frontend will fail to work.
Realtime metrics are enabled by default.
[log]
This section controls logging and log messages emitted by the server.
level
A log filter that controls which log statements are enabled within the server.
The filter expression can be either a single level (e.g. info
, or debug
) or a more complicated expression that specifies locations and their levels. The full syntax is documented in the env_filter
crate docs.
If not specified then the RUST_LOG
environment variable will be used to determine the log level.
The actual modules that can be filtered on are an internal implementation detail of systemslab but some useful ones are
log = "systemslab=debug"
- See more logging messages from within systemslab itself.log = "info,sqlx=debug
- See normal logging output but also show all SQL queries that are executed by the server.
access_log
The path to a directory in which to store the access log.
If not specified then access logs are emitted as part of the regular service log.
access_log_size
The maximum size, in bytes, before the access log file will be rotated.
access_log_count
The maximum number of old compressed access log files to keep around before they are deleted.
cluster_id
A cluster id that is used when reporting error telemetry.
This does not affect any of the functionality of systemslab but allows error telemetry to traced back to the customer cluster that emitted them.
This should be named something like <company name>/<cluster name>
. For example, a cluster within IOP systems might be named iop/cluster-1
.
[database]
0.0.103
This section contains config options for the database connection pool within the server.
max_connections
0.0.103
The maximum number of database connections that can be maintained by the connection pool.
Note that postgres itself has a limit on how many connections are available so setting this to too large of a value will exhaust all database connections. The default connection limit for postgres is 100 connections (with 2 of those reserved for superusers only).
The default connection limit is 50. Packages for systemslab set this to 80.
min_connections
0.0.103
The minimum number of connections to be maintained by the connection pool at all times. This does not guarantee that these connections will be idle, just that they will remain active.
Note that due to limitations in the underlying SQL library (sqlx
), this setting may not work as intended.
The default value for this option is 0.
acquire_timeout
0.0.103
The maximum amount of time to spend waiting for a database connection to become available before emitting an error. The value is a timeout specified in seconds.
WARNING
Setting this to a low value, such that methods within the server are unable to acquire connections, is likely to surface bugs within the server.
The default timeout is 30s.
acquire_slow_threshold
0.0.103
The threshold for warning about slow acquire times. The value is a timeout specified in seconds.
If acquiring a connection takes longer than this threshold then the server will emit a warning.
The default threshold is 2s.
max_lifetime
0.0.103
The maximum permitted lifetime of an individual connection. The value is the duration, in seconds.
Long lived connections may result in memory/resource leaks within the database. By periodically closing connections the database has the opportunity to clean up data structures associated with a session.
The default is 30 minutes.
idle_timeout
0.0.103
The maximum duration that a connection may remain idle before it is closed. The value is the duration, in seconds.
The default is 10 minutes.
[durable]
0.0.103
This section contains config options for the durable runtime that is run by the server. This is responsible for running control plane tasks that control how and where experiments are run, among other things.
enabled
0.0.103
Whether this server will run a durable worker.
At least one server in the cluster must be running a durable worker. If none are then no experiments will make progress or be scheduled.
This is true by default.
workflow_dir
0.0.103
A path to a directory containing the WASM workflow binaries for the control tasks run by the server.
Using binaries other than those shipped with the server is likely to result in unexpected errors. This option is mainly provided for installs that have placed files in a non-standard location.
The default is /usr/lib/systemslab-server/workflows
.
heartbeat_interval
0.0.103
The period at which workers will update their heartbeat timestamp in the database. This is used to determine if the worker is still alive.
The default is 30s.
heartbeat_timeout
0.0.103
The amount of time a worker can go without updating its heartbeat before it is considered to have disappeared. It is recommended to set this to at least 2x the heartbeat interval.
Note that this also controls how long tasks will be stranded if a worker dies unexpectedly, since they cannot be rescheduled until the worker is marked as lost. This is not an issue if the worker exited normally.
The default value is 120 seconds.
wasm_entry_ttl
0.0.103
The duration that a WASM binary will be kept around after it has last been used. After this period the server will remove the WASM binary.
The default duration is 86400 seconds (24 hours).
max_http_timeout
0.0.103
The maximum permitted timeout when a durable workflow is making HTTP requests.
The default value is 60s.
max_workflow_events
0.0.103
The maximum permitted number of events that can be emitted by a durable workflow before it will be forcibly terminated.
This is meant as a safety measure against buggy workflows that would use too many resources.
WARNING
Setting this to a value so low that normal execution of experiments hits it is likely to result in experiments appearing to be in "stuck" states. This can manifest as hosts being assigned to that experiment with no experiment every actually submitted to those hosts, jobs being stuck in the pending state forever, and more.
The default value is 100000 events.
suspend_timeout
0.0.103
The duration that a task will wait on a timer or notification before it gets suspended.
Setting this to a low value means that more tasks can be executed in the same amount of time (for a given max_tasks
setting) but that more tasks will need to be resumed from the database.
The default value for SystemsLab is 1 second.
suspend_margin
0.0.103
How far in advance a task will be woken up in advance of the timer it is blocked on completing.
This allows a task to be replayed up to the point where it was previously so that it is ready in time for the timer to complete.
The default value is 2 seconds.
max_tasks
0.0.103
The maximum number of tasks that are permitted to be actively executing on this worker at once.
Each task takes up resources (memory, CPU, active database connections). Reaching one of these limits will result in the task failing to execute. Generally, this will result in the task being restarted somewhere else, but it may also result in the task being marked as failed or the whole worker being restarted.
Note that this only limits the number of tasks that are actively executing on this worker. Completed tasks, suspended tasks, and queued tasks do not count.
The default value is 100.
max_concurrent_compilations
0.0.103
The maximum number of WASM binaries that can be compiled concurrently.
Compiling WASM down to machine code is rather expensive (e.g. a decently sized module can take 300ms to compile using all cores) so if a worker has to build a bunch at once it can use up all the available cores and memory on a machine.
Note that the resulting compiled code is cached once compiled so multiple tasks using the same WASM binary will not result in more compilations.
The default limit is 4.
[storage]
This section controls where and how systemslab stores artifacts.
bucket
The location at which to store artifacts. This can be a location on the local filesystem or a URL for a storage bucket.
- To store artifacts on the local filesystem you can specify a path explicitly:
/var/lib/systemslab/data
. - To store artifacts in S3, specify a S3 url:
s3://my-bucket-name
. By default, this will use AWS S3, but you can configure the endpoint for a different service by setting the options under[storage.s3]
. - To store artifacts in GCS, specify a GCS url:
gs://my-bucket-name
.
If not otherwise specified, systemslab-server
will store artifacts at /var/lib/systemslab/data
.
compression
Whether artifacts should be compressed before storing them. Many artifacts are quite compressible so enabling compression can significantly reduce the amount of storage needed to store them.
By default, this is true
.
[storage.s3]
This section contains configuration options specific to storing artifacts in S3. These can be used to specify authentication or to store artifacts in a different S3-compatible service, other than AWS S3.
If you do not use S3 to store artifacts then none of the options in this section will have any effect.
endpoint
The API endpoint to use when talking to S3.
If not specified then this defauts to https://s3.amazonaws.com
.
region
The signing region of this endpoint.
If not specified then the region will be detected based on the bucket name.
default_storage_class
The default storage class used for newly created artifacts.
Available values are:
DEEP_ARCHIVE
GLACIER
GLACIER_IR
INTELLIGENT_TIERING
ONEZONE_IA
OUTPOSTS
REDUCED_REDUNDANCY
STANDARD
STANDARD_IA
If not specified then this will use the default specified by the bucket configuration.
access_key_id
The access key used to authenticate with S3. If you specify this then you will likely also need to specify secret_access_key
as well.
If not specified then then the default AWS credential chain will be used instead.
secret_access_key
The access key secret used to authenticate with S3. If you specify this then you will likely also need to specify access_key_id
as well.
If not specified then the default AWS credential chain will be used instead.
[storage.gcs]
This section contains configuration options specific to storing artifacts in GCS. These can be used to specify authentication or to store artifacts in a different GCS-compatible service.
endpoint
The API endpoint to use. This defaults to the GCS endpoint.
service_account
The service account to use when authenticating to GCS.
If not specified then this will be determined from the current ambient GCS credentials.
credentials_json
The path to a service account JSON file to use when authenticating with GCS.
If not specified then the ambient GCS credentials will be used.
default_storage_class
The default storage class used for uploaded artifacts.
Available values are:
STANDARD
NEARLINE
COLDLINE
ARCHIVE
If not specified then the storage class used is decided by the bucket's configuration.