system-manager, init — manage the system as process #1
system-manager
[args
...]
init
[args
...]
system-manager is meant to be invoked as process #1, either as the first user process of an entire system, or as the first process of a "container" running within a Linux PID namespace or a BSD jail. It will not operate correctly if it is not process #1. To manage per-user, non-system-wide, stuff use per-user-manager(8). It should also not be confused with service-manager(1).
Its design is intended to keep process #1 simple, since the operating system regards it as a vital system process. In particular:
system-manager doesn't contain (or link to) library code for complex parsing and communications functionality, such as XML parsers and libraries for D-Bus, PAM, and udev. No parsing or RPC marshalling are done by process #1. It is also not involved in any Plug-and-Play device management or Desktop bus systems.
Process #1 is the system manager, as distinguished from the service manager which is another process. Process #1 does not contain nor manage service state tables. It does not have open file handles to the service control FIFOs, and its operation is not complicated by mixing the system state with individual service states.
Process #1 has no hand in calculating the details of system state changes. That's done by a separate program running as another process.
The operation of system-manager falls into four parts: process setup, system setup, reaping, and responding to system events.
system-manager expects to be started in the normal state for process #1 (of the system or of a container/jail). It does very little to its process state, which is inherited by the service manager and the logger:
It sets itself as a session leader, as if by setsid(1). If, as is the case on FreeBSD, the session already has a controlling TTY device, the association from the session to that device is removed.
(On operating systems that support this) It calls setlogin(2) to set the session's login name to root
.
It changes current directory to /
as if by chdir(1), on the grounds that on some systems there is an "initrd" mechanism that might have left the current directory somewhere else.
It resets the file/directory creation mask to 0000 as if by umask(1), on the same grounds.
It sets the hardwired default environment:
PATH
=/usr/local/bin
:/usr/local/sbin
:/usr/bin
:/usr/sbin
:/bin
:/sbin
LANG
=C.UTF-8
(Linux operating systems, per the GNU C library project and consequent initiatives in Gentoo, Fedora, Debian, and others)
LANG
=C
(others)
It reads the administrator-configurable default environment.
If the directory /etc/locale.d
exists, it processes it as if by envdir(1).
Otherwise it processes, as if by read-conf(1), the first file that is found (and can be opened for reading) in the list:
/etc/locale.conf
/etc/default/locale
/etc/sysconfig/i18n
/etc/sysconfig/language
/etc/sysconf/i18n
As the names indicate, this default environment is only expected to comprise locale-controlling variables such as LANG
.
system-manager performs various setup actions so that the full kernel "API" is visible to itself and its descendents:
It mounts the "API" filesystems in their accustomed places.
It creates the device nodes for various "early" devices that are required to exist before any plug-and-play device management services start up.
If control groups are available and it is in one, it enables the CPU, memory, IO, and tasks control group controllers for its own control group and for the service-manager.slice
control group immediately below it.
It moves itself into a me.slice
control group, so that the controllers can be enabled for sub-groups.
It instructs the kernel to send the signals for various optional system events such as secure-attention-key
and kbrequest
.
It corrects the system clock.
system-manager operates as a "grim reaper", cleaning up after any child processes that exit. The operating system re-parents a few orphaned processes (mainly ones started directly by the kernel) to it. system-manager spawns exactly three processes itself:
After creating a local domain socket at /run/service-manager/control
, it spawns an instance of service-manager(1).
If control groups are available, it is run in its own dedicated service-manager.slice
subordinate control group below the system-manager's original own.
This is the global service manager for the system, controlled through the socket.
It is not expected to ever terminate (before shutdown).
If it does, system-manager re-spawns it.
Most orphaned processes in the system are re-parented to this sub-process, or further subordinate per-user service manager processes, and not to system-manager.
As system events occur, it spawns (ephemeral) instances of system-control(1).
If control groups are available, they are is run in their own dedicated system-control.slice
subordinate control group below the system-manager's original own.
These calculate the details of service and target dependencices for system state changes, and pass instructions to the global service manager for bringing services up and down.
Only one instance is spawned at a time.
It spawns an instance of cyclog(1) with its input connected to the read end of a pipe.
If control groups are available, it is run in its own dedicated system-manager-log.slice
subordinate control group below the system-manager's original own.
This process is expected to only terminate when the pipe is closed, or the manager explicitly terminates it.
If it terminates and management is not stopping, system-manager simply re-spawns it.
The write end of the aforementioned pipe is connected to the the standard outputs and standard errors of the service manager, the (ephemeral) service controllers, and of system-manager itself.
(Their standard input is connected to /dev/null
.)
system-manager retains open file descriptors to this pipe, so that no unsaved log data are lost should the logger unexpectedly exit.
The logger is intended to be just for the system manager, the service manager, and the service controllers.
Actual services should be plumbed to their own logging services.
The logger is told to write its logfiles to /var/log/system-manager
, or failing that /run/system-manager/log
(which will by default be in a tmpfs filesystem), and to cap their maximum total size at 1MiB.
/var/log/system-manager
normally will not be mounted when the logger first starts up, and so will not be used until the system manager is told to restart logging.
Be aware that this makes it necessary to restart logging again at shutdown in order to then unmount that volume.
The only IPC mechanism provided by system-manager is signals. (Commands to manipulate services are sent to the spawned service manager, not to the system manager.) System-wide events are flagged, by the kernel and by other programs, by sending various signals to process #1. system-manager responds to these signals as follows:
SIGRTMIN + 3
,
SIGRTMIN + 4
,
SIGRTMIN + 5
, and
SIGRTMIN + 7
(and, for compatibility, respectively
SIGUSR1
,
SIGUSR2
,
SIGINT
, and
SIGWINCH
on BSD)
Spawn (respectively)
system-control start halt ,
system-control start poweroff ,
system-control start reboot , or
system-control start powercycle .
This will activate the
halt
,
poweroff
,
reboot
, or
powercycle
target.
Activating these targets activates the shutdown
target.
Other targets do not imply shutdown.
shutdown
is configured to conflict with login services and all normal server and workstation services, and will hence cause them to be stopped.
(This is written into the packaged target definitions, not hardwired into system-control(8).)
SIGRTMIN + 2
Spawn system-control start emergency .
This will activate the emergency
target.
SIGRTMIN + 1
Spawn system-control start rescue .
This will activate the rescue
target.
SIGRTMIN + 0
Spawn system-control start normal .
This will activate the normal
target.
SIGPWR
Spawn system-control activate powerfail .
This will activate the powerfail
target, which is expected to
take action to deal with impending power failure.
SIGWINCH
(on Linux)
Spawn system-control activate kbrequest .
This will activate the kbrequest
target.
SIGINT
(on Linux)
Spawn system-control activate secure-attention-key .
This will activate the secure-attention-key
target.
SIGRTMIN + 13
, SIGRTMIN + 14
, SIGRTMIN + 15
, SIGRTMIN + 17
Close the pipe, terminate the service manager, and wait a short while for it. If the system manager is the system-wide process #1, tell the kernel to flush its disc cache and (respectively) halt, power off, reboot, or power cycle the system. Otherwise, if the system manager is running in a container/jail, just exit.
When the reboot
, halt
, powercycle
, and poweroff
targets are fully active, they are expected to send the SIGRTMIN + 15
, SIGRTMIN + 13
, SIGRTMIN + 17
, and SIGRTMIN + 14
signals (respectively) to process #1.
In the packaged target definitions, they use the --force option to the reboot, halt, poweroff, and powercycle subcommands of system-control(8) to do this.
SIGRTMIN + 10
Spawn system-control activate sysinit .
This will activate the sysinit
target.
SIGRTMIN + 26
, SIGRTMIN + 27
, SIGRTMIN + 28
Terminate the logger process with SIGTERM
.
Normally it will be automatically restarted after it terminates.
This is used to make the logger start up in a different directory, after (for example) /var/log/system-manager
has been mounted, or before it is about to be unmounted.
SIGRTMIN + 26
forces the use of the /run/system-manager/log
directory.
SIGRTMIN + 27
and SIGRTMIN + 28
allow the use of (the first successfully accessible of) all potential log directories.
What the kbrequest
and secure-attention-key
targets do is configured by the system administrator.
For traditional Linux and BSD semantics, secure-attention-key
should run the reboot(8) command (or some wrapper around it) and kbrequest
should run the rescue(8) or emergency(8) command.
For semantics more akin to those of Microsoft Windows NT, secure-attention-key
should run login(1) on a (secure) console, or the GUI equivalent on a secure desktop; and kbrequest
should run vlock(1) or the GUI equivalent, similarly.
system-manager startup is also treated as a system event.
In response this "event" system-manager spawns system-control init , passing it the [args
...] that were supplied on its own command line.
(For process #1 of the entire system, these options are supplied to the initial program by the boot loader via the kernel.
In a container/jail, they are supplied by the container/jail configuration.)
This calculates what to initialize, deduced from those arguments, and sends appropriate signals back to the system manager process.
"API" filesystems are filesystems that do not employ persistent backing storage, and that provide means for interrogating and configuring kernel mechanisms. They are thus effectively extensions to the kernel's system call API.
/proc
A proc
filesystem is mounted here with options nodev
and nosuid
.
(noexec
is not used because that would disallow the trick of using /proc/
to re-execute a process' executable.)
N
/exe
/sys
The sysfs
filesystem is mounted here with options nodev
, noexec
, and nosuid
.
/run
A tmpfs
filesystem is mounted here with options nodev
, nosuid
, strictatime
, size=20%
, and mode=0755
.
/run/shm
A tmpfs
filesystem is mounted here with options nodev
, noexec
, nosuid
, strictatime
, size=50%
, and mode=01777
.
/dev
A devtmpfs
filesystem is mounted here with options nosuid
, strictatime
, size=10M
, and mode=0755
.
(noexec
is not used because old versions of programs such as /sbin/v86d
memory map devices such as /dev/zero
with PROT_EXEC
access for no good reason.
The newer versions of such programs were fixed in the first decade of the 21st century.)
/dev/pts
A devpts
filesystem is mounted here with options noexec
, nosuid
, ptmxmode=0666
, gid=
, tty
newinstance
, and mode=0620
.
tty
is currently hardwired to 5
, because the library functions for reading the system account database require dynamic link library and network functionality that are inappropriate for process #1.
/dev/ptmx
This is symbolically linked to /dev/pts/ptmx
to take advantage of the fact that the devpts
filesystem nowadays provides a ptmx
device node that is guaranteed correct for its own set of PTY devices.
With this, obtaining PTYs will work correctly even in a container.
/dev/fd
This is symbolically linked to /proc/self/fd
for compatibility with BSD programs that expect a single /dev/fd
tree for the current process.
/dev/core
This is symbolically linked to /proc/kcore
.
/dev/stdin
, /dev/stdout
, and /dev/stderr
These are symbolically linked to /proc/self/fd/0
, /proc/self/fd/1
, and /proc/self/fd/2
, respectively.
/dev/shm
This is symbolically linked to /run/shm
for compatibility with C/C++ libraries.
/sys/fs/cgroup
A tmpfs
filesystem is mounted here with options size=1M
, and mode=0755
.
This is so that subdirectories for actual (version 1) control group hierarchies can be created here as further mount points.
(With version 2 control groups, this would be the root of a single hierarchy.)
/sys/fs/cgroup/systemd
A cgroup
filesystem is mounted here with options name=systemd
and none
.
This sets up the root of a version 1 control group hierarchy that other toolset's tools will understand.
(With version 2 control groups, this would be at /sys/fs/cgroup
and have no name
parameter.)
/proc
A procfs
filesystem is mounted here with options nosuid
.
/run
A tmpfs
filesystem is mounted here with options nosuid
and size=20%
.
/run/shm
A tmpfs
filesystem is mounted here with options nosuid
and size=50%
.
/dev
A devfs
filesystem is mounted here with options nosuid
.
/dev/fd
A fdescfs
filesystem is mounted here with options nosuid
.
/dev/shm
This is symbolically linked to /run/shm
for compatibility with C/C++ libraries.
When the system starts process #1, the operating system kernel's system clock will have been initialized from a hardware real-time clock. On an all-BSD/Linux system, that hardware real-time clock will be running in UTC, and the system clock will thus be initially set to a proper UTC value.
On a more heterogenous system, the hardware real-time clock may be mistakenly running in a local time. This usually leads to some program or other, during the bootstrap process, having to determine the offset between RTC local time and UTC and correct the system clock. A consequence of this is that the system clock jumps by hours partway through the system bootstrap. In particular, system time leaps backwards for machines whose RTC local time is ahead of UTC, which is not something that POSIX programs are written to expect.
Furthermore, the operating system tries to do silly things with FAT volumes. Instead of just taking file and directory timestamps to be UTC, the filesystem driver by default takes the timestamps to be local time, so needs to know how to convert FAT local time (as on disc) to UTC (as seen at the system call interface with stat(2) and so forth). Because this is done in-kernel, a simplistic and hence broken mechanism is used. A single offset between FAT local time and UTC is applied to all timestamps (either system-wide or per-volume).
fsck(8) also needs to have the correct system time and the local time offset to hand. Otherwise, it miscalculates timestamps on FAT volumes, and compares the wrong system time against the superblock's "last checked" timestamps on EXT and other volumes. This means that the FAT local time offset must be provided to the kernel before any fsck(8) is run by the system bootstrap.
system-manager performs adjustments to the system clock and supplies the kernel with the one offset from FAT/RTC local time to UTC before the service manager or logger are started up, to ensure that time running backwards happens at a predictable point during the system bootstrap, and that it happens before any filesystem checks can run.
When the hardware clock is mistakenly runing in local time, the system clock is initialized to the wrong value, since the kernel is always expecting to read UTC from the hardware clock at that point. A special once-only variant of the settimeofday(2) system call both shifts back to UTC and sets the FAT time offset. Normally settimeofday(2) just sets the FAT time offset.
So as part of system initialization, system-manager calls the once-only special variant of settimeofday(2) to set the system time back to UTC and to provide the RTC-local-time-to-UTC and FAT-local-time-to-UTC offsets.
BSDs have a machdep.adjkerntz
variable and a machdep.wall_cmos_clock
variable (see sysctl(1)) that can be set by the kernel loader from loader.conf(5). These supply the offset between local time, as on FAT volumes and as in the hardware clock, and UTC.
Since they can be set before the kernel first sets the system clock (with inittodr(9)) and thus the local time offset can be applied from the get-go when first transferring from the hardware clock to the system clock, there is no requirement to step the system clock later in the bootstrap process.
However, they are often not set in loader.conf(5), and the system clock is initialized incorrectly.
So as part of system initialization, system-manager calculates machdep.adjkerntz
and machdep.wall_cmos_clock
(the latter from the existence of /etc/wall_cmos_clock
and the former from the timezone database), updates them, and changes the system clock with settimeofday(2).
The signal numbers should be uniform across BSD and Linux. They aren't because of the BSD shutdown(8) command, which sends signals directly to process #1, meaning that system-manager has to align with whatever signals it sends.
Because /usr/lib
and its ilk aren't necessarily present at mount time, the system-manager program image file is statically linked and also incorporates (copies of) the service-manager(1), system-control(1), and cyclog(1) commands as built-in commands.