rebase: update replaced k8s.io modules to v0.33.0

Signed-off-by: Niels de Vos <ndevos@ibm.com>
This commit is contained in:
Niels de Vos
2025-05-07 13:13:33 +02:00
committed by mergify[bot]
parent dd77e72800
commit 107407b44b
1723 changed files with 65035 additions and 175239 deletions

View File

@ -0,0 +1 @@
* @maintainer1 @maintainer2 @maintainer3

View File

@ -0,0 +1,150 @@
# Contribution Guidelines
Development happens on GitHub.
Issues are used for bugs and actionable items and longer discussions can happen on the [mailing list](#mailing-list).
The content of this repository is licensed under the [Apache License, Version 2.0](LICENSE).
## Code of Conduct
Participation in the Open Container community is governed by [Open Container Code of Conduct][code-of-conduct].
## Meetings
The contributors and maintainers of all OCI projects have monthly meetings at 2:00 PM (USA Pacific) on the first Wednesday of every month.
There is an [iCalendar][rfc5545] format for the meetings [here][meeting.ics].
Everyone is welcome to participate via [UberConference web][UberConference] or audio-only: +1 415 968 0849 (no PIN needed).
An initial agenda will be posted to the [mailing list](#mailing-list) in the week before each meeting, and everyone is welcome to propose additional topics or suggest other agenda alterations there.
Minutes from past meetings are archived [here][minutes].
## Mailing list
You can subscribe and browse the mailing list on [Google Groups][mailing-list].
## IRC
OCI discussion happens on #opencontainers on [Freenode][] ([logs][irc-logs]).
## Git
### Security issues
If you are reporting a security issue, do not create an issue or file a pull
request on GitHub. Instead, disclose the issue responsibly by sending an email
to security@opencontainers.org (which is inhabited only by the maintainers of
the various OCI projects).
### Pull requests are always welcome
We are always thrilled to receive pull requests, and do our best to
process them as fast as possible. Not sure if that typo is worth a pull
request? Do it! We will appreciate it.
If your pull request is not accepted on the first try, don't be
discouraged! If there's a problem with the implementation, hopefully you
received feedback on what to improve.
We're trying very hard to keep the project lean and focused. We don't want it
to do everything for everybody. This means that we might decide against
incorporating a new feature.
### Conventions
Fork the repo and make changes on your fork in a feature branch.
For larger bugs and enhancements, consider filing a leader issue or mailing-list thread for discussion that is independent of the implementation.
Small changes or changes that have been discussed on the [project mailing list](#mailing-list) may be submitted without a leader issue.
If the project has a test suite, submit unit tests for your changes. Take a
look at existing tests for inspiration. Run the full test suite on your branch
before submitting a pull request.
Update the documentation when creating or modifying features. Test
your documentation changes for clarity, concision, and correctness, as
well as a clean documentation build.
Pull requests descriptions should be as clear as possible and include a
reference to all the issues that they address.
Commit messages must start with a capitalized and short summary
written in the imperative, followed by an optional, more detailed
explanatory text which is separated from the summary by an empty line.
Code review comments may be added to your pull request. Discuss, then make the
suggested modifications and push additional commits to your feature branch. Be
sure to post a comment after pushing. The new commits will show up in the pull
request automatically, but the reviewers will not be notified unless you
comment.
Before the pull request is merged, make sure that you squash your commits into
logical units of work using `git rebase -i` and `git push -f`. After every
commit the test suite (if any) should be passing. Include documentation changes
in the same commit so that a revert would remove all traces of the feature or
fix.
Commits that fix or close an issue should include a reference like `Closes #XXX`
or `Fixes #XXX`, which will automatically close the issue when merged.
### Sign your work
The sign-off is a simple line at the end of the explanation for the
patch, which certifies that you wrote it or otherwise have the right to
pass it on as an open-source patch. The rules are pretty simple: if you
can certify the below (from [developercertificate.org][]):
```
Developer Certificate of Origin
Version 1.1
Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129
Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.
Developer's Certificate of Origin 1.1
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or
(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or
(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.
(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.
```
then you just add a line to every git commit message:
Signed-off-by: Joe Smith <joe@gmail.com>
using your real name (sorry, no pseudonyms or anonymous contributions.)
You can add the sign off when creating the git commit via `git commit -s`.
[code-of-conduct]: https://github.com/opencontainers/tob/blob/d2f9d68c1332870e40693fe077d311e0742bc73d/code-of-conduct.md
[developercertificate.org]: http://developercertificate.org/
[Freenode]: https://freenode.net/
[irc-logs]: http://ircbot.wl.linuxfoundation.org/eavesdrop/%23opencontainers/
[mailing-list]: https://groups.google.com/a/opencontainers.org/forum/#!forum/dev
[meeting.ics]: https://github.com/opencontainers/runtime-spec/blob/master/meeting.ics
[minutes]: http://ircbot.wl.linuxfoundation.org/meetings/opencontainers/
[rfc5545]: https://tools.ietf.org/html/rfc5545
[UberConference]: https://www.uberconference.com/opencontainers

View File

@ -0,0 +1,63 @@
# Project governance
The [OCI charter][charter] §5.b.viii tasks an OCI Project's maintainers (listed in the repository's MAINTAINERS file and sometimes referred to as "the TDC", [§5.e][charter]) with:
> Creating, maintaining and enforcing governance guidelines for the TDC, approved by the maintainers, and which shall be posted visibly for the TDC.
This section describes generic rules and procedures for fulfilling that mandate.
## Proposing a motion
A maintainer SHOULD propose a motion on the dev@opencontainers.org mailing list (except [security issues](#security-issues)) with another maintainer as a co-sponsor.
## Voting
Voting on a proposed motion SHOULD happen on the dev@opencontainers.org mailing list (except [security issues](#security-issues)) with maintainers posting LGTM or REJECT.
Maintainers MAY also explicitly not vote by posting ABSTAIN (which is useful to revert a previous vote).
Maintainers MAY post multiple times (e.g. as they revise their position based on feedback), but only their final post counts in the tally.
A proposed motion is adopted if two-thirds of votes cast, a quorum having voted, are in favor of the release.
Voting SHOULD remain open for a week to collect feedback from the wider community and allow the maintainers to digest the proposed motion.
Under exceptional conditions (e.g. non-major security fix releases) proposals which reach quorum with unanimous support MAY be adopted earlier.
A maintainer MAY choose to reply with REJECT.
A maintainer posting a REJECT MUST include a list of concerns or links to written documentation for those concerns (e.g. GitHub issues or mailing-list threads).
The maintainers SHOULD try to resolve the concerns and wait for the rejecting maintainer to change their opinion to LGTM.
However, a motion MAY be adopted with REJECTs, as outlined in the previous paragraphs.
## Quorum
A quorum is established when at least two-thirds of maintainers have voted.
For projects that are not specifications, a [motion to release](#release-approval) MAY be adopted if the tally is at least three LGTMs and no REJECTs, even if three votes does not meet the usual two-thirds quorum.
## Amendments
The [project governance](#project-governance) rules and procedures MAY be amended or replaced using the procedures themselves.
The MAINTAINERS of this project governance document is the total set of MAINTAINERS from all Open Containers projects (go-digest, image-spec, image-tools, runC, runtime-spec, runtime-tools, and selinux).
## Subject templates
Maintainers are busy and get lots of email.
To make project proposals recognizable, proposed motions SHOULD use the following subject templates.
### Proposing a motion
> [{project} VOTE]: {motion description} (closes {end of voting window})
For example:
> [runtime-spec VOTE]: Tag 0647920 as 1.0.0-rc (closes 2016-06-03 20:00 UTC)
### Tallying results
After voting closes, a maintainer SHOULD post a tally to the motion thread with a subject template like:
> [{project} {status}]: {motion description} (+{LGTMs} -{REJECTs} #{ABSTAINs})
Where `{status}` is either `adopted` or `rejected`.
For example:
> [runtime-spec adopted]: Tag 0647920 as 1.0.0-rc (+6 -0 #3)
[charter]: https://www.opencontainers.org/about/governance

201
e2e/vendor/github.com/opencontainers/cgroups/LICENSE generated vendored Normal file
View File

@ -0,0 +1,201 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "{}"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright {yyyy} {name of copyright owner}
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

View File

@ -0,0 +1,8 @@
Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp> (@AkihiroSuda)
Aleksa Sarai <cyphar@cyphar.com> (@cyphar)
Kir Kolyshkin <kolyshkin@gmail.com> (@kolyshkin)
Mrunal Patel <mpatel@redhat.com> (@mrunalp)
Sebastiaan van Stijn <github@gone.nl> (@thaJeztah)
Odin Ugedal <odin@uged.al> (@odinuge)
Peter Hunt <pehunt@redhat.com> (@haircommander)
Davanum Srinivas <davanum@gmail.com> (@dims)

View File

@ -0,0 +1,92 @@
## Introduction
Dear maintainer. Thank you for investing the time and energy to help
make this project as useful as possible. Maintaining a project is difficult,
sometimes unrewarding work. Sure, you will get to contribute cool
features to the project. But most of your time will be spent reviewing,
cleaning up, documenting, answering questions, justifying design
decisions - while everyone has all the fun! But remember - the quality
of the maintainers work is what distinguishes the good projects from the
great. So please be proud of your work, even the unglamourous parts,
and encourage a culture of appreciation and respect for *every* aspect
of improving the project - not just the hot new features.
This document is a manual for maintainers old and new. It explains what
is expected of maintainers, how they should work, and what tools are
available to them.
This is a living document - if you see something out of date or missing,
speak up!
## What are a maintainer's responsibilities?
It is every maintainer's responsibility to:
* Expose a clear roadmap for improving their component.
* Deliver prompt feedback and decisions on pull requests.
* Be available to anyone with questions, bug reports, criticism etc. on their component.
This includes IRC and GitHub issues and pull requests.
* Make sure their component respects the philosophy, design and roadmap of the project.
## How are decisions made?
This project is an open-source project with an open design philosophy. This
means that the repository is the source of truth for EVERY aspect of the
project, including its philosophy, design, roadmap and APIs. *If it's
part of the project, it's in the repo. It's in the repo, it's part of
the project.*
As a result, all decisions can be expressed as changes to the
repository. An implementation change is a change to the source code. An
API change is a change to the API specification. A philosophy change is
a change to the philosophy manifesto. And so on.
All decisions affecting this project, big and small, follow the same procedure:
1. Discuss a proposal on the [mailing list](CONTRIBUTING.md#mailing-list).
Anyone can do this.
2. Open a pull request.
Anyone can do this.
3. Discuss the pull request.
Anyone can do this.
4. Endorse (`LGTM`) or oppose (`Rejected`) the pull request.
The relevant maintainers do this (see below [Who decides what?](#who-decides-what)).
Changes that affect project management (changing policy, cutting releases, etc.) are [proposed and voted on the mailing list](GOVERNANCE.md).
5. Merge or close the pull request.
The relevant maintainers do this.
### I'm a maintainer, should I make pull requests too?
Yes. Nobody should ever push to master directly. All changes should be
made through a pull request.
## Who decides what?
All decisions are pull requests, and the relevant maintainers make
decisions by accepting or refusing the pull request. Review and acceptance
by anyone is denoted by adding a comment in the pull request: `LGTM`.
However, only currently listed `MAINTAINERS` are counted towards the required
two LGTMs. In addition, if a maintainer has created a pull request, they cannot
count toward the two LGTM rule (to ensure equal amounts of review for every pull
request, no matter who wrote it).
Overall the maintainer system works because of mutual respect.
The maintainers trust one another to act in the best interests of the project.
Sometimes maintainers can disagree and this is part of a healthy project to represent the points of view of various people.
In the case where maintainers cannot find agreement on a specific change, maintainers should use the [governance procedure](GOVERNANCE.md) to attempt to reach a consensus.
### How are maintainers added?
The best maintainers have a vested interest in the project. Maintainers
are first and foremost contributors that have shown they are committed to
the long term success of the project. Contributors wanting to become
maintainers are expected to be deeply involved in contributing code,
pull request review, and triage of issues in the project for more than two months.
Just contributing does not make you a maintainer, it is about building trust with the current maintainers of the project and being a person that they can depend on to act in the best interest of the project.
The final vote to add a new maintainer should be approved by the [governance procedure](GOVERNANCE.md).
### How are maintainers removed?
When a maintainer is unable to perform the [required duties](#what-are-a-maintainers-responsibilities) they can be removed by the [governance procedure](GOVERNANCE.md).
Issues related to a maintainer's performance should be discussed with them among the other maintainers so that they are not surprised by a pull request removing them.

11
e2e/vendor/github.com/opencontainers/cgroups/README.md generated vendored Normal file
View File

@ -0,0 +1,11 @@
# OCI Project Template
Useful boilerplate and organizational information for all OCI projects.
* README (this file)
* [The Apache License, Version 2.0](LICENSE)
* [A list of maintainers](MAINTAINERS)
* [Maintainer guidelines](MAINTAINERS_GUIDE.md)
* [Contributor guidelines](CONTRIBUTING.md)
* [Project governance](GOVERNANCE.md)
* [Release procedures](RELEASES.md)

View File

@ -0,0 +1,51 @@
# Releases
The release process hopes to encourage early, consistent consensus-building during project development.
The mechanisms used are regular community communication on the mailing list about progress, scheduled meetings for issue resolution and release triage, and regularly paced and communicated releases.
Releases are proposed and adopted or rejected using the usual [project governance](GOVERNANCE.md) rules and procedures.
An anti-pattern that we want to avoid is heavy development or discussions "late cycle" around major releases.
We want to build a community that is involved and communicates consistently through all releases instead of relying on "silent periods" as a judge of stability.
## Parallel releases
A single project MAY consider several motions to release in parallel.
However each motion to release after the initial 0.1.0 MUST be based on a previous release that has already landed.
For example, runtime-spec maintainers may propose a v1.0.0-rc2 on the 1st of the month and a v0.9.1 bugfix on the 2nd of the month.
They may not propose a v1.0.0-rc3 until the v1.0.0-rc2 is accepted (on the 7th if the vote initiated on the 1st passes).
## Specifications
The OCI maintains three categories of projects: specifications, applications, and conformance-testing tools.
However, specification releases have special restrictions in the [OCI charter][charter]:
* They are the target of backwards compatibility (§7.g), and
* They are subject to the OFWa patent grant (§8.d and e).
To avoid unfortunate side effects (onerous backwards compatibity requirements or Member resignations), the following additional procedures apply to specification releases:
### Planning a release
Every OCI specification project SHOULD hold meetings that involve maintainers reviewing pull requests, debating outstanding issues, and planning releases.
This meeting MUST be advertised on the project README and MAY happen on a phone call, video conference, or on IRC.
Maintainers MUST send updates to the dev@opencontainers.org with results of these meetings.
Before the specification reaches v1.0.0, the meetings SHOULD be weekly.
Once a specification has reached v1.0.0, the maintainers may alter the cadence, but a meeting MUST be held within four weeks of the previous meeting.
The release plans, corresponding milestones and estimated due dates MUST be published on GitHub (e.g. https://github.com/opencontainers/runtime-spec/milestones).
GitHub milestones and issues are only used for community organization and all releases MUST follow the [project governance](GOVERNANCE.md) rules and procedures.
### Timelines
Specifications have a variety of different timelines in their lifecycle.
* Pre-v1.0.0 specifications SHOULD release on a monthly cadence to garner feedback.
* Major specification releases MUST release at least three release candidates spaced a minimum of one week apart.
This means a major release like a v1.0.0 or v2.0.0 release will take 1 month at minimum: one week for rc1, one week for rc2, one week for rc3, and one week for the major release itself.
Maintainers SHOULD strive to make zero breaking changes during this cycle of release candidates and SHOULD restart the three-candidate count when a breaking change is introduced.
For example if a breaking change is introduced in v1.0.0-rc2 then the series would end with v1.0.0-rc4 and v1.0.0.
* Minor and patch releases SHOULD be made on an as-needed basis.
[charter]: https://www.opencontainers.org/about/governance

View File

@ -2,8 +2,6 @@ package cgroups
import (
"errors"
"github.com/opencontainers/runc/libcontainer/configs"
)
var (
@ -18,11 +16,11 @@ var (
// DevicesSetV1 and DevicesSetV2 are functions to set devices for
// cgroup v1 and v2, respectively. Unless
// [github.com/opencontainers/runc/libcontainer/cgroups/devices]
// [github.com/opencontainers/cgroups/devices]
// package is imported, it is set to nil, so cgroup managers can't
// manage devices.
DevicesSetV1 func(path string, r *configs.Resources) error
DevicesSetV2 func(path string, r *configs.Resources) error
DevicesSetV1 func(path string, r *Resources) error
DevicesSetV2 func(path string, r *Resources) error
)
type Manager interface {
@ -42,7 +40,7 @@ type Manager interface {
GetStats() (*Stats, error)
// Freeze sets the freezer cgroup to the specified state.
Freeze(state configs.FreezerState) error
Freeze(state FreezerState) error
// Destroy removes cgroup.
Destroy() error
@ -54,7 +52,7 @@ type Manager interface {
// Set sets cgroup resources parameters/limits. If the argument is nil,
// the resources specified during Manager creation (or the previous call
// to Set) are used.
Set(r *configs.Resources) error
Set(r *Resources) error
// GetPaths returns cgroup path(s) to save in a state file in order to
// restore later.
@ -67,10 +65,10 @@ type Manager interface {
GetPaths() map[string]string
// GetCgroups returns the cgroup data as configured.
GetCgroups() (*configs.Cgroup, error)
GetCgroups() (*Cgroup, error)
// GetFreezerState retrieves the current FreezerState of the cgroup.
GetFreezerState() (configs.FreezerState, error)
GetFreezerState() (FreezerState, error)
// Exists returns whether the cgroup path exists or not.
Exists() bool

View File

@ -1,4 +1,4 @@
package configs
package cgroups
type HugepageLimit struct {
// which type of hugepage to limit.

View File

@ -1,8 +1,8 @@
package configs
package cgroups
import (
systemdDbus "github.com/coreos/go-systemd/v22/dbus"
"github.com/opencontainers/runc/libcontainer/devices"
devices "github.com/opencontainers/cgroups/devices/config"
)
type FreezerState string

View File

@ -1,4 +1,4 @@
package configs
package cgroups
// LinuxRdma for Linux cgroup 'rdma' resource management (Linux 4.11)
type LinuxRdma struct {

View File

@ -1,6 +1,6 @@
//go:build !linux
package configs
package cgroups
// Cgroup holds properties of a cgroup on Linux
// TODO Windows: This can ultimately be entirely factored out on Windows as

View File

@ -0,0 +1,14 @@
package config
import (
"errors"
"golang.org/x/sys/unix"
)
func mkDev(d *Rule) (uint64, error) {
if d.Major == Wildcard || d.Minor == Wildcard {
return 0, errors.New("cannot mkdev() device with wildcards")
}
return unix.Mkdev(uint32(d.Major), uint32(d.Minor)), nil
}

View File

@ -5,12 +5,11 @@ import (
"errors"
"fmt"
"os"
"path"
"path/filepath"
"strconv"
"strings"
"sync"
"github.com/opencontainers/runc/libcontainer/utils"
"github.com/sirupsen/logrus"
"golang.org/x/sys/unix"
)
@ -147,12 +146,15 @@ func openFile(dir, file string, flags int) (*os.File, error) {
flags |= os.O_TRUNC | os.O_CREATE
mode = 0o600
}
path := path.Join(dir, utils.CleanPath(file))
// NOTE it is important to use filepath.Clean("/"+file) here
// (see https://github.com/opencontainers/runc/issues/4103)!
path := filepath.Join(dir, filepath.Clean("/"+file))
if prepareOpenat2() != nil {
return openFallback(path, flags, mode)
}
relPath := strings.TrimPrefix(path, cgroupfsPrefix)
if len(relPath) == len(path) { // non-standard path, old system?
relPath, ok := strings.CutPrefix(path, cgroupfsPrefix)
if !ok { // Non-standard path, old system?
return openFallback(path, flags, mode)
}
@ -171,10 +173,8 @@ func openFile(dir, file string, flags int) (*os.File, error) {
//
// TODO: if such usage will ever be common, amend this
// to reopen cgroupRootHandle and retry openat2.
fdPath, closer := utils.ProcThreadSelf("fd/" + strconv.Itoa(int(cgroupRootHandle.Fd())))
defer closer()
fdDest, _ := os.Readlink(fdPath)
if fdDest != cgroupfsDir {
fdDest, fdErr := os.Readlink("/proc/thread-self/fd/" + strconv.Itoa(int(cgroupRootHandle.Fd())))
if fdErr == nil && fdDest != cgroupfsDir {
// Wrap the error so it is clear that cgroupRootHandle
// is opened to an unexpected/wrong directory.
err = fmt.Errorf("cgroupRootHandle %d unexpectedly opened to %s != %s: %w",

View File

@ -7,8 +7,7 @@ import (
"strconv"
"strings"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
)
type BlkioGroup struct {
@ -20,11 +19,11 @@ func (s *BlkioGroup) Name() string {
return "blkio"
}
func (s *BlkioGroup) Apply(path string, _ *configs.Resources, pid int) error {
func (s *BlkioGroup) Apply(path string, _ *cgroups.Resources, pid int) error {
return apply(path, pid)
}
func (s *BlkioGroup) Set(path string, r *configs.Resources) error {
func (s *BlkioGroup) Set(path string, r *cgroups.Resources) error {
s.detectWeightFilenames(path)
if r.BlkioWeight != 0 {
if err := cgroups.WriteFile(path, s.weightFilename, strconv.FormatUint(uint64(r.BlkioWeight), 10)); err != nil {

View File

@ -7,9 +7,8 @@ import (
"os"
"strconv"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
"github.com/opencontainers/cgroups/fscommon"
"golang.org/x/sys/unix"
)
@ -19,7 +18,7 @@ func (s *CpuGroup) Name() string {
return "cpu"
}
func (s *CpuGroup) Apply(path string, r *configs.Resources, pid int) error {
func (s *CpuGroup) Apply(path string, r *cgroups.Resources, pid int) error {
if err := os.MkdirAll(path, 0o755); err != nil {
return err
}
@ -34,7 +33,7 @@ func (s *CpuGroup) Apply(path string, r *configs.Resources, pid int) error {
return cgroups.WriteCgroupProc(path, pid)
}
func (s *CpuGroup) SetRtSched(path string, r *configs.Resources) error {
func (s *CpuGroup) SetRtSched(path string, r *cgroups.Resources) error {
var period string
if r.CpuRtPeriod != 0 {
period = strconv.FormatUint(r.CpuRtPeriod, 10)
@ -64,7 +63,7 @@ func (s *CpuGroup) SetRtSched(path string, r *configs.Resources) error {
return nil
}
func (s *CpuGroup) Set(path string, r *configs.Resources) error {
func (s *CpuGroup) Set(path string, r *cgroups.Resources) error {
if r.CpuShares != 0 {
shares := r.CpuShares
if err := cgroups.WriteFile(path, "cpu.shares", strconv.FormatUint(shares, 10)); err != nil {

View File

@ -6,20 +6,12 @@ import (
"strconv"
"strings"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
"github.com/opencontainers/cgroups/fscommon"
)
const (
cgroupCpuacctStat = "cpuacct.stat"
cgroupCpuacctUsageAll = "cpuacct.usage_all"
nanosecondsInSecond = 1000000000
userModeColumn = 1
kernelModeColumn = 2
cuacctUsageAllColumnsNumber = 3
nsInSec = 1000000000
// The value comes from `C.sysconf(C._SC_CLK_TCK)`, and
// on Linux it's a constant which is safe to be hard coded,
@ -34,11 +26,11 @@ func (s *CpuacctGroup) Name() string {
return "cpuacct"
}
func (s *CpuacctGroup) Apply(path string, _ *configs.Resources, pid int) error {
func (s *CpuacctGroup) Apply(path string, _ *cgroups.Resources, pid int) error {
return apply(path, pid)
}
func (s *CpuacctGroup) Set(_ string, _ *configs.Resources) error {
func (s *CpuacctGroup) Set(_ string, _ *cgroups.Resources) error {
return nil
}
@ -81,7 +73,7 @@ func getCpuUsageBreakdown(path string) (uint64, uint64, error) {
const (
userField = "user"
systemField = "system"
file = cgroupCpuacctStat
file = "cpuacct.stat"
)
// Expected format:
@ -103,7 +95,7 @@ func getCpuUsageBreakdown(path string) (uint64, uint64, error) {
return 0, 0, &parseError{Path: path, File: file, Err: err}
}
return (userModeUsage * nanosecondsInSecond) / clockTicks, (kernelModeUsage * nanosecondsInSecond) / clockTicks, nil
return (userModeUsage * nsInSec) / clockTicks, (kernelModeUsage * nsInSec) / clockTicks, nil
}
func getPercpuUsage(path string) ([]uint64, error) {
@ -113,7 +105,6 @@ func getPercpuUsage(path string) ([]uint64, error) {
if err != nil {
return percpuUsage, err
}
// TODO: use strings.SplitN instead.
for _, value := range strings.Fields(data) {
value, err := strconv.ParseUint(value, 10, 64)
if err != nil {
@ -127,7 +118,7 @@ func getPercpuUsage(path string) ([]uint64, error) {
func getPercpuUsageInModes(path string) ([]uint64, []uint64, error) {
usageKernelMode := []uint64{}
usageUserMode := []uint64{}
const file = cgroupCpuacctUsageAll
const file = "cpuacct.usage_all"
fd, err := cgroups.OpenFile(path, file, os.O_RDONLY)
if os.IsNotExist(err) {
@ -141,22 +132,23 @@ func getPercpuUsageInModes(path string) ([]uint64, []uint64, error) {
scanner.Scan() // skipping header line
for scanner.Scan() {
lineFields := strings.SplitN(scanner.Text(), " ", cuacctUsageAllColumnsNumber+1)
if len(lineFields) != cuacctUsageAllColumnsNumber {
// Each line is: cpu user system
fields := strings.SplitN(scanner.Text(), " ", 3)
if len(fields) != 3 {
continue
}
usageInKernelMode, err := strconv.ParseUint(lineFields[kernelModeColumn], 10, 64)
user, err := strconv.ParseUint(fields[1], 10, 64)
if err != nil {
return nil, nil, &parseError{Path: path, File: file, Err: err}
}
usageKernelMode = append(usageKernelMode, usageInKernelMode)
usageUserMode = append(usageUserMode, user)
usageInUserMode, err := strconv.ParseUint(lineFields[userModeColumn], 10, 64)
kernel, err := strconv.ParseUint(fields[2], 10, 64)
if err != nil {
return nil, nil, &parseError{Path: path, File: file, Err: err}
}
usageUserMode = append(usageUserMode, usageInUserMode)
usageKernelMode = append(usageKernelMode, kernel)
}
if err := scanner.Err(); err != nil {
return nil, nil, &parseError{Path: path, File: file, Err: err}

View File

@ -6,32 +6,66 @@ import (
"path/filepath"
"strconv"
"strings"
"sync"
"golang.org/x/sys/unix"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
"github.com/opencontainers/cgroups/fscommon"
)
var (
cpusetLock sync.Mutex
cpusetPrefix = "cpuset."
cpusetFastPath bool
)
func cpusetFile(path string, name string) string {
cpusetLock.Lock()
defer cpusetLock.Unlock()
// Only the v1 cpuset cgroup is allowed to mount with noprefix.
// See kernel source: https://github.com/torvalds/linux/blob/2e1b3cc9d7f790145a80cb705b168f05dab65df2/kernel/cgroup/cgroup-v1.c#L1070
// Cpuset cannot be mounted with and without prefix simultaneously.
// Commonly used in Android environments.
if cpusetFastPath {
return cpusetPrefix + name
}
err := unix.Access(filepath.Join(path, cpusetPrefix+name), unix.F_OK)
if err == nil {
// Use the fast path only if we can access one type of mount for cpuset already
cpusetFastPath = true
} else {
err = unix.Access(filepath.Join(path, name), unix.F_OK)
if err == nil {
cpusetPrefix = ""
cpusetFastPath = true
}
}
return cpusetPrefix + name
}
type CpusetGroup struct{}
func (s *CpusetGroup) Name() string {
return "cpuset"
}
func (s *CpusetGroup) Apply(path string, r *configs.Resources, pid int) error {
func (s *CpusetGroup) Apply(path string, r *cgroups.Resources, pid int) error {
return s.ApplyDir(path, r, pid)
}
func (s *CpusetGroup) Set(path string, r *configs.Resources) error {
func (s *CpusetGroup) Set(path string, r *cgroups.Resources) error {
if r.CpusetCpus != "" {
if err := cgroups.WriteFile(path, "cpuset.cpus", r.CpusetCpus); err != nil {
if err := cgroups.WriteFile(path, cpusetFile(path, "cpus"), r.CpusetCpus); err != nil {
return err
}
}
if r.CpusetMems != "" {
if err := cgroups.WriteFile(path, "cpuset.mems", r.CpusetMems); err != nil {
if err := cgroups.WriteFile(path, cpusetFile(path, "mems"), r.CpusetMems); err != nil {
return err
}
}
@ -49,26 +83,23 @@ func getCpusetStat(path string, file string) ([]uint16, error) {
}
for _, s := range strings.Split(fileContent, ",") {
sp := strings.SplitN(s, "-", 3)
switch len(sp) {
case 3:
return extracted, &parseError{Path: path, File: file, Err: errors.New("extra dash")}
case 2:
min, err := strconv.ParseUint(sp[0], 10, 16)
fromStr, toStr, ok := strings.Cut(s, "-")
if ok {
from, err := strconv.ParseUint(fromStr, 10, 16)
if err != nil {
return extracted, &parseError{Path: path, File: file, Err: err}
}
max, err := strconv.ParseUint(sp[1], 10, 16)
to, err := strconv.ParseUint(toStr, 10, 16)
if err != nil {
return extracted, &parseError{Path: path, File: file, Err: err}
}
if min > max {
return extracted, &parseError{Path: path, File: file, Err: errors.New("invalid values, min > max")}
if from > to {
return extracted, &parseError{Path: path, File: file, Err: errors.New("invalid values, from > to")}
}
for i := min; i <= max; i++ {
for i := from; i <= to; i++ {
extracted = append(extracted, uint16(i))
}
case 1:
} else {
value, err := strconv.ParseUint(s, 10, 16)
if err != nil {
return extracted, &parseError{Path: path, File: file, Err: err}
@ -83,57 +114,57 @@ func getCpusetStat(path string, file string) ([]uint16, error) {
func (s *CpusetGroup) GetStats(path string, stats *cgroups.Stats) error {
var err error
stats.CPUSetStats.CPUs, err = getCpusetStat(path, "cpuset.cpus")
stats.CPUSetStats.CPUs, err = getCpusetStat(path, cpusetFile(path, "cpus"))
if err != nil && !errors.Is(err, os.ErrNotExist) {
return err
}
stats.CPUSetStats.CPUExclusive, err = fscommon.GetCgroupParamUint(path, "cpuset.cpu_exclusive")
stats.CPUSetStats.CPUExclusive, err = fscommon.GetCgroupParamUint(path, cpusetFile(path, "cpu_exclusive"))
if err != nil && !errors.Is(err, os.ErrNotExist) {
return err
}
stats.CPUSetStats.Mems, err = getCpusetStat(path, "cpuset.mems")
stats.CPUSetStats.Mems, err = getCpusetStat(path, cpusetFile(path, "mems"))
if err != nil && !errors.Is(err, os.ErrNotExist) {
return err
}
stats.CPUSetStats.MemHardwall, err = fscommon.GetCgroupParamUint(path, "cpuset.mem_hardwall")
stats.CPUSetStats.MemHardwall, err = fscommon.GetCgroupParamUint(path, cpusetFile(path, "mem_hardwall"))
if err != nil && !errors.Is(err, os.ErrNotExist) {
return err
}
stats.CPUSetStats.MemExclusive, err = fscommon.GetCgroupParamUint(path, "cpuset.mem_exclusive")
stats.CPUSetStats.MemExclusive, err = fscommon.GetCgroupParamUint(path, cpusetFile(path, "mem_exclusive"))
if err != nil && !errors.Is(err, os.ErrNotExist) {
return err
}
stats.CPUSetStats.MemoryMigrate, err = fscommon.GetCgroupParamUint(path, "cpuset.memory_migrate")
stats.CPUSetStats.MemoryMigrate, err = fscommon.GetCgroupParamUint(path, cpusetFile(path, "memory_migrate"))
if err != nil && !errors.Is(err, os.ErrNotExist) {
return err
}
stats.CPUSetStats.MemorySpreadPage, err = fscommon.GetCgroupParamUint(path, "cpuset.memory_spread_page")
stats.CPUSetStats.MemorySpreadPage, err = fscommon.GetCgroupParamUint(path, cpusetFile(path, "memory_spread_page"))
if err != nil && !errors.Is(err, os.ErrNotExist) {
return err
}
stats.CPUSetStats.MemorySpreadSlab, err = fscommon.GetCgroupParamUint(path, "cpuset.memory_spread_slab")
stats.CPUSetStats.MemorySpreadSlab, err = fscommon.GetCgroupParamUint(path, cpusetFile(path, "memory_spread_slab"))
if err != nil && !errors.Is(err, os.ErrNotExist) {
return err
}
stats.CPUSetStats.MemoryPressure, err = fscommon.GetCgroupParamUint(path, "cpuset.memory_pressure")
stats.CPUSetStats.MemoryPressure, err = fscommon.GetCgroupParamUint(path, cpusetFile(path, "memory_pressure"))
if err != nil && !errors.Is(err, os.ErrNotExist) {
return err
}
stats.CPUSetStats.SchedLoadBalance, err = fscommon.GetCgroupParamUint(path, "cpuset.sched_load_balance")
stats.CPUSetStats.SchedLoadBalance, err = fscommon.GetCgroupParamUint(path, cpusetFile(path, "sched_load_balance"))
if err != nil && !errors.Is(err, os.ErrNotExist) {
return err
}
stats.CPUSetStats.SchedRelaxDomainLevel, err = fscommon.GetCgroupParamInt(path, "cpuset.sched_relax_domain_level")
stats.CPUSetStats.SchedRelaxDomainLevel, err = fscommon.GetCgroupParamInt(path, cpusetFile(path, "sched_relax_domain_level"))
if err != nil && !errors.Is(err, os.ErrNotExist) {
return err
}
@ -141,7 +172,7 @@ func (s *CpusetGroup) GetStats(path string, stats *cgroups.Stats) error {
return nil
}
func (s *CpusetGroup) ApplyDir(dir string, r *configs.Resources, pid int) error {
func (s *CpusetGroup) ApplyDir(dir string, r *cgroups.Resources, pid int) error {
// This might happen if we have no cpuset cgroup mounted.
// Just do nothing and don't fail.
if dir == "" {
@ -172,10 +203,10 @@ func (s *CpusetGroup) ApplyDir(dir string, r *configs.Resources, pid int) error
}
func getCpusetSubsystemSettings(parent string) (cpus, mems string, err error) {
if cpus, err = cgroups.ReadFile(parent, "cpuset.cpus"); err != nil {
if cpus, err = cgroups.ReadFile(parent, cpusetFile(parent, "cpus")); err != nil {
return
}
if mems, err = cgroups.ReadFile(parent, "cpuset.mems"); err != nil {
if mems, err = cgroups.ReadFile(parent, cpusetFile(parent, "mems")); err != nil {
return
}
return cpus, mems, nil
@ -221,12 +252,12 @@ func cpusetCopyIfNeeded(current, parent string) error {
}
if isEmptyCpuset(currentCpus) {
if err := cgroups.WriteFile(current, "cpuset.cpus", parentCpus); err != nil {
if err := cgroups.WriteFile(current, cpusetFile(current, "cpus"), parentCpus); err != nil {
return err
}
}
if isEmptyCpuset(currentMems) {
if err := cgroups.WriteFile(current, "cpuset.mems", parentMems); err != nil {
if err := cgroups.WriteFile(current, cpusetFile(current, "mems"), parentMems); err != nil {
return err
}
}
@ -237,7 +268,7 @@ func isEmptyCpuset(str string) bool {
return str == "" || str == "\n"
}
func (s *CpusetGroup) ensureCpusAndMems(path string, r *configs.Resources) error {
func (s *CpusetGroup) ensureCpusAndMems(path string, r *cgroups.Resources) error {
if err := s.Set(path, r); err != nil {
return err
}

View File

@ -1,8 +1,7 @@
package fs
import (
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
)
type DevicesGroup struct{}
@ -11,7 +10,7 @@ func (s *DevicesGroup) Name() string {
return "devices"
}
func (s *DevicesGroup) Apply(path string, r *configs.Resources, pid int) error {
func (s *DevicesGroup) Apply(path string, r *cgroups.Resources, pid int) error {
if r.SkipDevices {
return nil
}
@ -24,7 +23,7 @@ func (s *DevicesGroup) Apply(path string, r *configs.Resources, pid int) error {
return apply(path, pid)
}
func (s *DevicesGroup) Set(path string, r *configs.Resources) error {
func (s *DevicesGroup) Set(path string, r *cgroups.Resources) error {
if cgroups.DevicesSetV1 == nil {
if len(r.Devices) == 0 {
return nil

View File

@ -3,7 +3,7 @@ package fs
import (
"fmt"
"github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
"github.com/opencontainers/cgroups/fscommon"
)
type parseError = fscommon.ParseError

View File

@ -7,8 +7,7 @@ import (
"strings"
"time"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
"github.com/sirupsen/logrus"
"golang.org/x/sys/unix"
)
@ -19,19 +18,19 @@ func (s *FreezerGroup) Name() string {
return "freezer"
}
func (s *FreezerGroup) Apply(path string, _ *configs.Resources, pid int) error {
func (s *FreezerGroup) Apply(path string, _ *cgroups.Resources, pid int) error {
return apply(path, pid)
}
func (s *FreezerGroup) Set(path string, r *configs.Resources) (Err error) {
func (s *FreezerGroup) Set(path string, r *cgroups.Resources) (Err error) {
switch r.Freezer {
case configs.Frozen:
case cgroups.Frozen:
defer func() {
if Err != nil {
// Freezing failed, and it is bad and dangerous
// to leave the cgroup in FROZEN or FREEZING
// state, so (try to) thaw it back.
_ = cgroups.WriteFile(path, "freezer.state", string(configs.Thawed))
_ = cgroups.WriteFile(path, "freezer.state", string(cgroups.Thawed))
}
}()
@ -64,11 +63,11 @@ func (s *FreezerGroup) Set(path string, r *configs.Resources) (Err error) {
// the chances to succeed in freezing
// in case new processes keep appearing
// in the cgroup.
_ = cgroups.WriteFile(path, "freezer.state", string(configs.Thawed))
_ = cgroups.WriteFile(path, "freezer.state", string(cgroups.Thawed))
time.Sleep(10 * time.Millisecond)
}
if err := cgroups.WriteFile(path, "freezer.state", string(configs.Frozen)); err != nil {
if err := cgroups.WriteFile(path, "freezer.state", string(cgroups.Frozen)); err != nil {
return err
}
@ -87,7 +86,7 @@ func (s *FreezerGroup) Set(path string, r *configs.Resources) (Err error) {
switch state {
case "FREEZING":
continue
case string(configs.Frozen):
case string(cgroups.Frozen):
if i > 1 {
logrus.Debugf("frozen after %d retries", i)
}
@ -99,9 +98,9 @@ func (s *FreezerGroup) Set(path string, r *configs.Resources) (Err error) {
}
// Despite our best efforts, it got stuck in FREEZING.
return errors.New("unable to freeze")
case configs.Thawed:
return cgroups.WriteFile(path, "freezer.state", string(configs.Thawed))
case configs.Undefined:
case cgroups.Thawed:
return cgroups.WriteFile(path, "freezer.state", string(cgroups.Thawed))
case cgroups.Undefined:
return nil
default:
return fmt.Errorf("Invalid argument '%s' to freezer.state", string(r.Freezer))
@ -112,7 +111,7 @@ func (s *FreezerGroup) GetStats(path string, stats *cgroups.Stats) error {
return nil
}
func (s *FreezerGroup) GetState(path string) (configs.FreezerState, error) {
func (s *FreezerGroup) GetState(path string) (cgroups.FreezerState, error) {
for {
state, err := cgroups.ReadFile(path, "freezer.state")
if err != nil {
@ -121,11 +120,11 @@ func (s *FreezerGroup) GetState(path string) (configs.FreezerState, error) {
if os.IsNotExist(err) || errors.Is(err, unix.ENODEV) {
err = nil
}
return configs.Undefined, err
return cgroups.Undefined, err
}
switch strings.TrimSpace(state) {
case "THAWED":
return configs.Thawed, nil
return cgroups.Thawed, nil
case "FROZEN":
// Find out whether the cgroup is frozen directly,
// or indirectly via an ancestor.
@ -136,15 +135,15 @@ func (s *FreezerGroup) GetState(path string) (configs.FreezerState, error) {
if errors.Is(err, os.ErrNotExist) || errors.Is(err, unix.ENODEV) {
err = nil
}
return configs.Frozen, err
return cgroups.Frozen, err
}
switch self {
case "0\n":
return configs.Thawed, nil
return cgroups.Thawed, nil
case "1\n":
return configs.Frozen, nil
return cgroups.Frozen, nil
default:
return configs.Undefined, fmt.Errorf(`unknown "freezer.self_freezing" state: %q`, self)
return cgroups.Undefined, fmt.Errorf(`unknown "freezer.self_freezing" state: %q`, self)
}
case "FREEZING":
// Make sure we get a stable freezer state, so retry if the cgroup
@ -152,7 +151,7 @@ func (s *FreezerGroup) GetState(path string) (configs.FreezerState, error) {
time.Sleep(1 * time.Millisecond)
continue
default:
return configs.Undefined, fmt.Errorf("unknown freezer.state %q", state)
return cgroups.Undefined, fmt.Errorf("unknown freezer.state %q", state)
}
}
}

View File

@ -8,9 +8,8 @@ import (
"golang.org/x/sys/unix"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
"github.com/opencontainers/cgroups/fscommon"
)
var subsystems = []subsystem{
@ -49,22 +48,22 @@ type subsystem interface {
// Apply creates and joins a cgroup, adding pid into it. Some
// subsystems use resources to pre-configure the cgroup parents
// before creating or joining it.
Apply(path string, r *configs.Resources, pid int) error
Apply(path string, r *cgroups.Resources, pid int) error
// Set sets the cgroup resources.
Set(path string, r *configs.Resources) error
Set(path string, r *cgroups.Resources) error
}
type Manager struct {
mu sync.Mutex
cgroups *configs.Cgroup
cgroups *cgroups.Cgroup
paths map[string]string
}
func NewManager(cg *configs.Cgroup, paths map[string]string) (*Manager, error) {
func NewManager(cg *cgroups.Cgroup, paths map[string]string) (*Manager, error) {
// Some v1 controllers (cpu, cpuset, and devices) expect
// cgroups.Resources to not be nil in Apply.
if cg.Resources == nil {
return nil, errors.New("cgroup v1 manager needs configs.Resources to be set during manager creation")
return nil, errors.New("cgroup v1 manager needs cgroups.Resources to be set during manager creation")
}
if cg.Resources.Unified != nil {
return nil, cgroups.ErrV1NoUnified
@ -168,7 +167,7 @@ func (m *Manager) GetStats() (*cgroups.Stats, error) {
return stats, nil
}
func (m *Manager) Set(r *configs.Resources) error {
func (m *Manager) Set(r *cgroups.Resources) error {
if r == nil {
return nil
}
@ -203,7 +202,7 @@ func (m *Manager) Set(r *configs.Resources) error {
// Freeze toggles the container's freezer cgroup depending on the state
// provided
func (m *Manager) Freeze(state configs.FreezerState) error {
func (m *Manager) Freeze(state cgroups.FreezerState) error {
path := m.Path("freezer")
if path == "" {
return errors.New("cannot toggle freezer: cgroups not configured for container")
@ -233,15 +232,15 @@ func (m *Manager) GetPaths() map[string]string {
return m.paths
}
func (m *Manager) GetCgroups() (*configs.Cgroup, error) {
func (m *Manager) GetCgroups() (*cgroups.Cgroup, error) {
return m.cgroups, nil
}
func (m *Manager) GetFreezerState() (configs.FreezerState, error) {
func (m *Manager) GetFreezerState() (cgroups.FreezerState, error) {
dir := m.Path("freezer")
// If the container doesn't have the freezer cgroup, say it's undefined.
if dir == "" {
return configs.Undefined, nil
return cgroups.Undefined, nil
}
freezer := &FreezerGroup{}
return freezer.GetState(dir)

View File

@ -5,9 +5,8 @@ import (
"os"
"strconv"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
"github.com/opencontainers/cgroups/fscommon"
)
type HugetlbGroup struct{}
@ -16,11 +15,11 @@ func (s *HugetlbGroup) Name() string {
return "hugetlb"
}
func (s *HugetlbGroup) Apply(path string, _ *configs.Resources, pid int) error {
func (s *HugetlbGroup) Apply(path string, _ *cgroups.Resources, pid int) error {
return apply(path, pid)
}
func (s *HugetlbGroup) Set(path string, r *configs.Resources) error {
func (s *HugetlbGroup) Set(path string, r *cgroups.Resources) error {
const suffix = ".limit_in_bytes"
skipRsvd := false

View File

@ -12,9 +12,8 @@ import (
"golang.org/x/sys/unix"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
"github.com/opencontainers/cgroups/fscommon"
)
const (
@ -30,7 +29,7 @@ func (s *MemoryGroup) Name() string {
return "memory"
}
func (s *MemoryGroup) Apply(path string, _ *configs.Resources, pid int) error {
func (s *MemoryGroup) Apply(path string, _ *cgroups.Resources, pid int) error {
return apply(path, pid)
}
@ -66,7 +65,7 @@ func setSwap(path string, val int64) error {
return cgroups.WriteFile(path, cgroupMemorySwapLimit, strconv.FormatInt(val, 10))
}
func setMemoryAndSwap(path string, r *configs.Resources) error {
func setMemoryAndSwap(path string, r *cgroups.Resources) error {
// If the memory update is set to -1 and the swap is not explicitly
// set, we should also set swap to -1, it means unlimited memory.
if r.Memory == -1 && r.MemorySwap == 0 {
@ -108,7 +107,7 @@ func setMemoryAndSwap(path string, r *configs.Resources) error {
return nil
}
func (s *MemoryGroup) Set(path string, r *configs.Resources) error {
func (s *MemoryGroup) Set(path string, r *cgroups.Resources) error {
if err := setMemoryAndSwap(path, r); err != nil {
return err
}

View File

@ -1,8 +1,7 @@
package fs
import (
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
)
type NameGroup struct {
@ -14,7 +13,7 @@ func (s *NameGroup) Name() string {
return s.GroupName
}
func (s *NameGroup) Apply(path string, _ *configs.Resources, pid int) error {
func (s *NameGroup) Apply(path string, _ *cgroups.Resources, pid int) error {
if s.Join {
// Ignore errors if the named cgroup does not exist.
_ = apply(path, pid)
@ -22,7 +21,7 @@ func (s *NameGroup) Apply(path string, _ *configs.Resources, pid int) error {
return nil
}
func (s *NameGroup) Set(_ string, _ *configs.Resources) error {
func (s *NameGroup) Set(_ string, _ *cgroups.Resources) error {
return nil
}

View File

@ -3,8 +3,7 @@ package fs
import (
"strconv"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
)
type NetClsGroup struct{}
@ -13,11 +12,11 @@ func (s *NetClsGroup) Name() string {
return "net_cls"
}
func (s *NetClsGroup) Apply(path string, _ *configs.Resources, pid int) error {
func (s *NetClsGroup) Apply(path string, _ *cgroups.Resources, pid int) error {
return apply(path, pid)
}
func (s *NetClsGroup) Set(path string, r *configs.Resources) error {
func (s *NetClsGroup) Set(path string, r *cgroups.Resources) error {
if r.NetClsClassid != 0 {
if err := cgroups.WriteFile(path, "net_cls.classid", strconv.FormatUint(uint64(r.NetClsClassid), 10)); err != nil {
return err

View File

@ -1,8 +1,7 @@
package fs
import (
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
)
type NetPrioGroup struct{}
@ -11,11 +10,11 @@ func (s *NetPrioGroup) Name() string {
return "net_prio"
}
func (s *NetPrioGroup) Apply(path string, _ *configs.Resources, pid int) error {
func (s *NetPrioGroup) Apply(path string, _ *cgroups.Resources, pid int) error {
return apply(path, pid)
}
func (s *NetPrioGroup) Set(path string, r *configs.Resources) error {
func (s *NetPrioGroup) Set(path string, r *cgroups.Resources) error {
for _, prioMap := range r.NetPrioIfpriomap {
if err := cgroups.WriteFile(path, "net_prio.ifpriomap", prioMap.CgroupString()); err != nil {
return err

View File

@ -8,9 +8,8 @@ import (
"golang.org/x/sys/unix"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/runc/libcontainer/utils"
"github.com/opencontainers/cgroups"
"github.com/opencontainers/cgroups/internal/path"
)
// The absolute path to the root of the cgroup hierarchies.
@ -21,13 +20,13 @@ var (
const defaultCgroupRoot = "/sys/fs/cgroup"
func initPaths(cg *configs.Cgroup) (map[string]string, error) {
func initPaths(cg *cgroups.Cgroup) (map[string]string, error) {
root, err := rootPath()
if err != nil {
return nil, err
}
inner, err := innerPath(cg)
inner, err := path.Inner(cg)
if err != nil {
return nil, err
}
@ -136,22 +135,6 @@ func rootPath() (string, error) {
return cgroupRoot, nil
}
func innerPath(c *configs.Cgroup) (string, error) {
if (c.Name != "" || c.Parent != "") && c.Path != "" {
return "", errors.New("cgroup: either Path or Name and Parent should be used")
}
// XXX: Do not remove CleanPath. Path safety is important! -- cyphar
innerPath := utils.CleanPath(c.Path)
if innerPath == "" {
cgParent := utils.CleanPath(c.Parent)
cgName := utils.CleanPath(c.Name)
innerPath = filepath.Join(cgParent, cgName)
}
return innerPath, nil
}
func subsysPath(root, inner, subsystem string) (string, error) {
// If the cgroup name/path is absolute do not look relative to the cgroup of the init process.
if filepath.IsAbs(inner) {

View File

@ -1,8 +1,7 @@
package fs
import (
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
)
type PerfEventGroup struct{}
@ -11,11 +10,11 @@ func (s *PerfEventGroup) Name() string {
return "perf_event"
}
func (s *PerfEventGroup) Apply(path string, _ *configs.Resources, pid int) error {
func (s *PerfEventGroup) Apply(path string, _ *cgroups.Resources, pid int) error {
return apply(path, pid)
}
func (s *PerfEventGroup) Set(_ string, _ *configs.Resources) error {
func (s *PerfEventGroup) Set(_ string, _ *cgroups.Resources) error {
return nil
}

View File

@ -4,9 +4,8 @@ import (
"math"
"strconv"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
"github.com/opencontainers/cgroups/fscommon"
)
type PidsGroup struct{}
@ -15,11 +14,11 @@ func (s *PidsGroup) Name() string {
return "pids"
}
func (s *PidsGroup) Apply(path string, _ *configs.Resources, pid int) error {
func (s *PidsGroup) Apply(path string, _ *cgroups.Resources, pid int) error {
return apply(path, pid)
}
func (s *PidsGroup) Set(path string, r *configs.Resources) error {
func (s *PidsGroup) Set(path string, r *cgroups.Resources) error {
if r.PidsLimit != 0 {
// "max" is the fallback value.
limit := "max"

View File

@ -1,9 +1,8 @@
package fs
import (
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
"github.com/opencontainers/cgroups/fscommon"
)
type RdmaGroup struct{}
@ -12,11 +11,11 @@ func (s *RdmaGroup) Name() string {
return "rdma"
}
func (s *RdmaGroup) Apply(path string, _ *configs.Resources, pid int) error {
func (s *RdmaGroup) Apply(path string, _ *cgroups.Resources, pid int) error {
return apply(path, pid)
}
func (s *RdmaGroup) Set(path string, r *configs.Resources) error {
func (s *RdmaGroup) Set(path string, r *cgroups.Resources) error {
return fscommon.RdmaSet(path, r)
}

View File

@ -8,17 +8,16 @@ import (
"golang.org/x/sys/unix"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
"github.com/opencontainers/cgroups/fscommon"
)
func isCpuSet(r *configs.Resources) bool {
func isCPUSet(r *cgroups.Resources) bool {
return r.CpuWeight != 0 || r.CpuQuota != 0 || r.CpuPeriod != 0 || r.CPUIdle != nil || r.CpuBurst != nil
}
func setCpu(dirPath string, r *configs.Resources) error {
if !isCpuSet(r) {
func setCPU(dirPath string, r *cgroups.Resources) error {
if !isCPUSet(r) {
return nil
}

View File

@ -1,15 +1,14 @@
package fs2
import (
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
)
func isCpusetSet(r *configs.Resources) bool {
func isCpusetSet(r *cgroups.Resources) bool {
return r.CpusetCpus != "" || r.CpusetMems != ""
}
func setCpuset(dirPath string, r *configs.Resources) error {
func setCpuset(dirPath string, r *cgroups.Resources) error {
if !isCpusetSet(r) {
return nil
}

View File

@ -6,8 +6,7 @@ import (
"path/filepath"
"strings"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
)
func supportedControllers() (string, error) {
@ -18,7 +17,7 @@ func supportedControllers() (string, error) {
// based on (1) controllers available and (2) resources that are being set.
// We don't check "pseudo" controllers such as
// "freezer" and "devices".
func needAnyControllers(r *configs.Resources) (bool, error) {
func needAnyControllers(r *cgroups.Resources) (bool, error) {
if r == nil {
return false, nil
}
@ -48,7 +47,7 @@ func needAnyControllers(r *configs.Resources) (bool, error) {
if isIoSet(r) && have("io") {
return true, nil
}
if isCpuSet(r) && have("cpu") {
if isCPUSet(r) && have("cpu") {
return true, nil
}
if isCpusetSet(r) && have("cpuset") {
@ -64,12 +63,12 @@ func needAnyControllers(r *configs.Resources) (bool, error) {
// containsDomainController returns whether the current config contains domain controller or not.
// Refer to: http://man7.org/linux/man-pages/man7/cgroups.7.html
// As at Linux 4.19, the following controllers are threaded: cpu, perf_event, and pids.
func containsDomainController(r *configs.Resources) bool {
return isMemorySet(r) || isIoSet(r) || isCpuSet(r) || isHugeTlbSet(r)
func containsDomainController(r *cgroups.Resources) bool {
return isMemorySet(r) || isIoSet(r) || isCPUSet(r) || isHugeTlbSet(r)
}
// CreateCgroupPath creates cgroupv2 path, enabling all the supported controllers.
func CreateCgroupPath(path string, c *configs.Cgroup) (Err error) {
func CreateCgroupPath(path string, c *cgroups.Cgroup) (Err error) {
if !strings.HasPrefix(path, UnifiedMountpoint) {
return fmt.Errorf("invalid cgroup path %s", path)
}

View File

@ -19,40 +19,25 @@ package fs2
import (
"bufio"
"errors"
"fmt"
"io"
"os"
"path/filepath"
"strings"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/runc/libcontainer/utils"
"github.com/opencontainers/cgroups"
"github.com/opencontainers/cgroups/internal/path"
)
const UnifiedMountpoint = "/sys/fs/cgroup"
func defaultDirPath(c *configs.Cgroup) (string, error) {
if (c.Name != "" || c.Parent != "") && c.Path != "" {
return "", fmt.Errorf("cgroup: either Path or Name and Parent should be used, got %+v", c)
func defaultDirPath(c *cgroups.Cgroup) (string, error) {
innerPath, err := path.Inner(c)
if err != nil {
return "", err
}
return _defaultDirPath(UnifiedMountpoint, c.Path, c.Parent, c.Name)
}
func _defaultDirPath(root, cgPath, cgParent, cgName string) (string, error) {
if (cgName != "" || cgParent != "") && cgPath != "" {
return "", errors.New("cgroup: either Path or Name and Parent should be used")
}
// XXX: Do not remove CleanPath. Path safety is important! -- cyphar
innerPath := utils.CleanPath(cgPath)
if innerPath == "" {
cgParent := utils.CleanPath(cgParent)
cgName := utils.CleanPath(cgName)
innerPath = filepath.Join(cgParent, cgName)
}
if filepath.IsAbs(innerPath) {
return filepath.Join(root, innerPath), nil
return filepath.Join(UnifiedMountpoint, innerPath), nil
}
// we don't need to use /proc/thread-self here because runc always runs
@ -67,7 +52,7 @@ func _defaultDirPath(root, cgPath, cgParent, cgName string) (string, error) {
// A parent cgroup (with no tasks in it) is what we need.
ownCgroup = filepath.Dir(ownCgroup)
return filepath.Join(root, ownCgroup, innerPath), nil
return filepath.Join(UnifiedMountpoint, ownCgroup, innerPath), nil
}
// parseCgroupFile parses /proc/PID/cgroup file and return string
@ -83,16 +68,9 @@ func parseCgroupFile(path string) (string, error) {
func parseCgroupFromReader(r io.Reader) (string, error) {
s := bufio.NewScanner(r)
for s.Scan() {
var (
text = s.Text()
parts = strings.SplitN(text, ":", 3)
)
if len(parts) < 3 {
return "", fmt.Errorf("invalid cgroup entry: %q", text)
}
// text is like "0::/user.slice/user-1001.slice/session-1.scope"
if parts[0] == "0" && parts[1] == "" {
return parts[2], nil
// "0::/user.slice/user-1001.slice/session-1.scope"
if path, ok := strings.CutPrefix(s.Text(), "0::"); ok {
return path, nil
}
}
if err := s.Err(); err != nil {

View File

@ -10,18 +10,17 @@ import (
"golang.org/x/sys/unix"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
)
func setFreezer(dirPath string, state configs.FreezerState) error {
func setFreezer(dirPath string, state cgroups.FreezerState) error {
var stateStr string
switch state {
case configs.Undefined:
case cgroups.Undefined:
return nil
case configs.Frozen:
case cgroups.Frozen:
stateStr = "1"
case configs.Thawed:
case cgroups.Thawed:
stateStr = "0"
default:
return fmt.Errorf("invalid freezer state %q requested", state)
@ -32,7 +31,7 @@ func setFreezer(dirPath string, state configs.FreezerState) error {
// We can ignore this request as long as the user didn't ask us to
// freeze the container (since without the freezer cgroup, that's a
// no-op).
if state != configs.Frozen {
if state != cgroups.Frozen {
return nil
}
return fmt.Errorf("freezer not supported: %w", err)
@ -51,7 +50,7 @@ func setFreezer(dirPath string, state configs.FreezerState) error {
return nil
}
func getFreezer(dirPath string) (configs.FreezerState, error) {
func getFreezer(dirPath string) (cgroups.FreezerState, error) {
fd, err := cgroups.OpenFile(dirPath, "cgroup.freeze", unix.O_RDONLY)
if err != nil {
// If the kernel is too old, then we just treat the freezer as being in
@ -59,36 +58,36 @@ func getFreezer(dirPath string) (configs.FreezerState, error) {
if os.IsNotExist(err) || errors.Is(err, unix.ENODEV) {
err = nil
}
return configs.Undefined, err
return cgroups.Undefined, err
}
defer fd.Close()
return readFreezer(dirPath, fd)
}
func readFreezer(dirPath string, fd *os.File) (configs.FreezerState, error) {
func readFreezer(dirPath string, fd *os.File) (cgroups.FreezerState, error) {
if _, err := fd.Seek(0, 0); err != nil {
return configs.Undefined, err
return cgroups.Undefined, err
}
state := make([]byte, 2)
if _, err := fd.Read(state); err != nil {
return configs.Undefined, err
return cgroups.Undefined, err
}
switch string(state) {
case "0\n":
return configs.Thawed, nil
return cgroups.Thawed, nil
case "1\n":
return waitFrozen(dirPath)
default:
return configs.Undefined, fmt.Errorf(`unknown "cgroup.freeze" state: %q`, state)
return cgroups.Undefined, fmt.Errorf(`unknown "cgroup.freeze" state: %q`, state)
}
}
// waitFrozen polls cgroup.events until it sees "frozen 1" in it.
func waitFrozen(dirPath string) (configs.FreezerState, error) {
func waitFrozen(dirPath string) (cgroups.FreezerState, error) {
fd, err := cgroups.OpenFile(dirPath, "cgroup.events", unix.O_RDONLY)
if err != nil {
return configs.Undefined, err
return cgroups.Undefined, err
}
defer fd.Close()
@ -103,13 +102,11 @@ func waitFrozen(dirPath string) (configs.FreezerState, error) {
scanner := bufio.NewScanner(fd)
for i := 0; scanner.Scan(); {
if i == maxIter {
return configs.Undefined, fmt.Errorf("timeout of %s reached waiting for the cgroup to freeze", waitTime*maxIter)
return cgroups.Undefined, fmt.Errorf("timeout of %s reached waiting for the cgroup to freeze", waitTime*maxIter)
}
line := scanner.Text()
val := strings.TrimPrefix(line, "frozen ")
if val != line { // got prefix
if val, ok := strings.CutPrefix(scanner.Text(), "frozen "); ok {
if val[0] == '1' {
return configs.Frozen, nil
return cgroups.Frozen, nil
}
i++
@ -117,11 +114,11 @@ func waitFrozen(dirPath string) (configs.FreezerState, error) {
time.Sleep(waitTime)
_, err := fd.Seek(0, 0)
if err != nil {
return configs.Undefined, err
return cgroups.Undefined, err
}
}
}
// Should only reach here either on read error,
// or if the file does not contain "frozen " line.
return configs.Undefined, scanner.Err()
return cgroups.Undefined, scanner.Err()
}

View File

@ -6,15 +6,14 @@ import (
"os"
"strings"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
"github.com/opencontainers/cgroups/fscommon"
)
type parseError = fscommon.ParseError
type Manager struct {
config *configs.Cgroup
config *cgroups.Cgroup
// dirPath is like "/sys/fs/cgroup/user.slice/user-1001.slice/session-1.scope"
dirPath string
// controllers is content of "cgroup.controllers" file.
@ -25,7 +24,7 @@ type Manager struct {
// NewManager creates a manager for cgroup v2 unified hierarchy.
// dirPath is like "/sys/fs/cgroup/user.slice/user-1001.slice/session-1.scope".
// If dirPath is empty, it is automatically set using config.
func NewManager(config *configs.Cgroup, dirPath string) (*Manager, error) {
func NewManager(config *cgroups.Cgroup, dirPath string) (*Manager, error) {
if dirPath == "" {
var err error
dirPath, err = defaultDirPath(config)
@ -143,7 +142,7 @@ func (m *Manager) GetStats() (*cgroups.Stats, error) {
return st, nil
}
func (m *Manager) Freeze(state configs.FreezerState) error {
func (m *Manager) Freeze(state cgroups.FreezerState) error {
if m.config.Resources == nil {
return errors.New("cannot toggle freezer: cgroups not configured for container")
}
@ -162,7 +161,7 @@ func (m *Manager) Path(_ string) string {
return m.dirPath
}
func (m *Manager) Set(r *configs.Resources) error {
func (m *Manager) Set(r *cgroups.Resources) error {
if r == nil {
return nil
}
@ -182,7 +181,7 @@ func (m *Manager) Set(r *configs.Resources) error {
return err
}
// cpu (since kernel 4.15)
if err := setCpu(m.dirPath, r); err != nil {
if err := setCPU(m.dirPath, r); err != nil {
return err
}
// devices (since kernel 4.15, pseudo-controller)
@ -218,7 +217,7 @@ func (m *Manager) Set(r *configs.Resources) error {
return nil
}
func setDevices(dirPath string, r *configs.Resources) error {
func setDevices(dirPath string, r *cgroups.Resources) error {
if cgroups.DevicesSetV2 == nil {
if len(r.Devices) > 0 {
return cgroups.ErrDevicesUnsupported
@ -238,11 +237,10 @@ func (m *Manager) setUnified(res map[string]string) error {
if errors.Is(err, os.ErrPermission) || errors.Is(err, os.ErrNotExist) {
// Check if a controller is available,
// to give more specific error if not.
sk := strings.SplitN(k, ".", 2)
if len(sk) != 2 {
c, _, ok := strings.Cut(k, ".")
if !ok {
return fmt.Errorf("unified resource %q must be in the form CONTROLLER.PARAMETER", k)
}
c := sk[0]
if _, ok := m.controllers[c]; !ok && c != "cgroup" {
return fmt.Errorf("unified resource %q can't be set: controller %q not available", k, c)
}
@ -260,11 +258,11 @@ func (m *Manager) GetPaths() map[string]string {
return paths
}
func (m *Manager) GetCgroups() (*configs.Cgroup, error) {
func (m *Manager) GetCgroups() (*cgroups.Cgroup, error) {
return m.config, nil
}
func (m *Manager) GetFreezerState() (configs.FreezerState, error) {
func (m *Manager) GetFreezerState() (cgroups.FreezerState, error) {
return getFreezer(m.dirPath)
}
@ -285,7 +283,7 @@ func (m *Manager) OOMKillCount() (uint64, error) {
return c, err
}
func CheckMemoryUsage(dirPath string, r *configs.Resources) error {
func CheckMemoryUsage(dirPath string, r *cgroups.Resources) error {
if !r.MemoryCheckBeforeUpdate {
return nil
}

View File

@ -5,16 +5,15 @@ import (
"os"
"strconv"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
"github.com/opencontainers/cgroups/fscommon"
)
func isHugeTlbSet(r *configs.Resources) bool {
func isHugeTlbSet(r *cgroups.Resources) bool {
return len(r.HugetlbLimit) > 0
}
func setHugeTlb(dirPath string, r *configs.Resources) error {
func setHugeTlb(dirPath string, r *cgroups.Resources) error {
if !isHugeTlbSet(r) {
return nil
}

View File

@ -10,11 +10,10 @@ import (
"github.com/sirupsen/logrus"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
)
func isIoSet(r *configs.Resources) bool {
func isIoSet(r *cgroups.Resources) bool {
return r.BlkioWeight != 0 ||
len(r.BlkioWeightDevice) > 0 ||
len(r.BlkioThrottleReadBpsDevice) > 0 ||
@ -37,7 +36,7 @@ func bfqDeviceWeightSupported(bfq *os.File) bool {
return err != nil
}
func setIo(dirPath string, r *configs.Resources) error {
func setIo(dirPath string, r *cgroups.Resources) error {
if !isIoSet(r) {
return nil
}

View File

@ -10,9 +10,8 @@ import (
"golang.org/x/sys/unix"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
"github.com/opencontainers/cgroups/fscommon"
)
// numToStr converts an int64 value to a string for writing to a
@ -32,11 +31,11 @@ func numToStr(value int64) (ret string) {
return ret
}
func isMemorySet(r *configs.Resources) bool {
func isMemorySet(r *cgroups.Resources) bool {
return r.MemoryReservation != 0 || r.Memory != 0 || r.MemorySwap != 0
}
func setMemory(dirPath string, r *configs.Resources) error {
func setMemory(dirPath string, r *cgroups.Resources) error {
if !isMemorySet(r) {
return nil
}

View File

@ -5,8 +5,8 @@ import (
"os"
"strings"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
"github.com/opencontainers/cgroups"
"github.com/opencontainers/cgroups/fscommon"
)
func statMisc(dirPath string, stats *cgroups.Stats) error {

View File

@ -8,16 +8,15 @@ import (
"golang.org/x/sys/unix"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
"github.com/opencontainers/cgroups/fscommon"
)
func isPidsSet(r *configs.Resources) bool {
func isPidsSet(r *cgroups.Resources) bool {
return r.PidsLimit != 0
}
func setPids(dirPath string, r *configs.Resources) error {
func setPids(dirPath string, r *cgroups.Resources) error {
if !isPidsSet(r) {
return nil
}

View File

@ -10,7 +10,7 @@ import (
"golang.org/x/sys/unix"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/cgroups"
)
func statPSI(dirPath string, file string) (*cgroups.PSIStats, error) {
@ -58,12 +58,12 @@ func statPSI(dirPath string, file string) (*cgroups.PSIStats, error) {
func parsePSIData(psi []string) (cgroups.PSIData, error) {
data := cgroups.PSIData{}
for _, f := range psi {
kv := strings.SplitN(f, "=", 2)
if len(kv) != 2 {
key, val, ok := strings.Cut(f, "=")
if !ok {
return data, fmt.Errorf("invalid psi data: %q", f)
}
var pv *float64
switch kv[0] {
switch key {
case "avg10":
pv = &data.Avg10
case "avg60":
@ -71,16 +71,16 @@ func parsePSIData(psi []string) (cgroups.PSIData, error) {
case "avg300":
pv = &data.Avg300
case "total":
v, err := strconv.ParseUint(kv[1], 10, 64)
v, err := strconv.ParseUint(val, 10, 64)
if err != nil {
return data, fmt.Errorf("invalid %s PSI value: %w", kv[0], err)
return data, fmt.Errorf("invalid %s PSI value: %w", key, err)
}
data.Total = v
}
if pv != nil {
v, err := strconv.ParseFloat(kv[1], 64)
v, err := strconv.ParseFloat(val, 64)
if err != nil {
return data, fmt.Errorf("invalid %s PSI value: %w", kv[0], err)
return data, fmt.Errorf("invalid %s PSI value: %w", key, err)
}
*pv = v
}

View File

@ -8,23 +8,21 @@ import (
"strconv"
"strings"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/configs"
"golang.org/x/sys/unix"
"github.com/opencontainers/cgroups"
)
// parseRdmaKV parses raw string to RdmaEntry.
func parseRdmaKV(raw string, entry *cgroups.RdmaEntry) error {
var value uint32
parts := strings.SplitN(raw, "=", 3)
k, v, ok := strings.Cut(raw, "=")
if len(parts) != 2 {
if !ok {
return errors.New("Unable to parse RDMA entry")
}
k, v := parts[0], parts[1]
if v == "max" {
value = math.MaxUint32
} else {
@ -34,9 +32,10 @@ func parseRdmaKV(raw string, entry *cgroups.RdmaEntry) error {
}
value = uint32(val64)
}
if k == "hca_handle" {
switch k {
case "hca_handle":
entry.HcaHandles = value
} else if k == "hca_object" {
case "hca_object":
entry.HcaObjects = value
}
@ -99,7 +98,7 @@ func RdmaGetStats(path string, stats *cgroups.Stats) error {
return nil
}
func createCmdString(device string, limits configs.LinuxRdma) string {
func createCmdString(device string, limits cgroups.LinuxRdma) string {
cmdString := device
if limits.HcaHandles != nil {
cmdString += " hca_handle=" + strconv.FormatUint(uint64(*limits.HcaHandles), 10)
@ -111,7 +110,7 @@ func createCmdString(device string, limits configs.LinuxRdma) string {
}
// RdmaSet sets RDMA resources.
func RdmaSet(path string, r *configs.Resources) error {
func RdmaSet(path string, r *cgroups.Resources) error {
for device, limits := range r.Rdma {
if err := cgroups.WriteFile(path, "rdma.max", createCmdString(device, limits)); err != nil {
return err

View File

@ -8,7 +8,7 @@ import (
"strconv"
"strings"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/cgroups"
)
var (
@ -54,38 +54,39 @@ func ParseUint(s string, base, bitSize int) (uint64, error) {
return value, nil
}
// ParseKeyValue parses a space-separated "name value" kind of cgroup
// ParseKeyValue parses a space-separated "key value" kind of cgroup
// parameter and returns its key as a string, and its value as uint64
// (ParseUint is used to convert the value). For example,
// (using [ParseUint] to convert the value). For example,
// "io_service_bytes 1234" will be returned as "io_service_bytes", 1234.
func ParseKeyValue(t string) (string, uint64, error) {
parts := strings.SplitN(t, " ", 3)
if len(parts) != 2 {
return "", 0, fmt.Errorf("line %q is not in key value format", t)
key, val, ok := strings.Cut(t, " ")
if !ok || key == "" || val == "" {
return "", 0, fmt.Errorf(`line %q is not in "key value" format`, t)
}
value, err := ParseUint(parts[1], 10, 64)
value, err := ParseUint(val, 10, 64)
if err != nil {
return "", 0, err
}
return parts[0], value, nil
return key, value, nil
}
// GetValueByKey reads a key-value pairs from the specified cgroup file,
// and returns a value of the specified key. ParseUint is used for value
// conversion.
// GetValueByKey reads space-separated "key value" pairs from the specified
// cgroup file, looking for a specified key, and returns its value as uint64,
// using [ParseUint] for conversion. If the value is not found, 0 is returned.
func GetValueByKey(path, file, key string) (uint64, error) {
content, err := cgroups.ReadFile(path, file)
if err != nil {
return 0, err
}
key += " "
lines := strings.Split(content, "\n")
for _, line := range lines {
arr := strings.Split(line, " ")
if len(arr) == 2 && arr[0] == key {
val, err := ParseUint(arr[1], 10, 64)
v, ok := strings.CutPrefix(line, key)
if ok {
val, err := ParseUint(v, 10, 64)
if err != nil {
err = &ParseError{Path: path, File: file, Err: err}
}
@ -103,7 +104,6 @@ func GetCgroupParamUint(path, file string) (uint64, error) {
if err != nil {
return 0, err
}
contents = strings.TrimSpace(contents)
if contents == "max" {
return math.MaxUint64, nil
}
@ -118,11 +118,10 @@ func GetCgroupParamUint(path, file string) (uint64, error) {
// GetCgroupParamInt reads a single int64 value from specified cgroup file.
// If the value read is "max", the math.MaxInt64 is returned.
func GetCgroupParamInt(path, file string) (int64, error) {
contents, err := cgroups.ReadFile(path, file)
contents, err := GetCgroupParamString(path, file)
if err != nil {
return 0, err
}
contents = strings.TrimSpace(contents)
if contents == "max" {
return math.MaxInt64, nil
}

View File

@ -0,0 +1,52 @@
package path
import (
"errors"
"os"
"path/filepath"
"github.com/opencontainers/cgroups"
)
// Inner returns a path to cgroup relative to a cgroup mount point, based
// on cgroup configuration, or an error, if cgroup configuration is invalid.
// To be used only by fs cgroup managers (systemd has different path rules).
func Inner(c *cgroups.Cgroup) (string, error) {
if (c.Name != "" || c.Parent != "") && c.Path != "" {
return "", errors.New("cgroup: either Path or Name and Parent should be used")
}
// XXX: Do not remove cleanPath. Path safety is important! -- cyphar
innerPath := cleanPath(c.Path)
if innerPath == "" {
cgParent := cleanPath(c.Parent)
cgName := cleanPath(c.Name)
innerPath = filepath.Join(cgParent, cgName)
}
return innerPath, nil
}
// cleanPath is a copy of github.com/opencontainers/runc/libcontainer/utils.CleanPath.
func cleanPath(path string) string {
// Deal with empty strings nicely.
if path == "" {
return ""
}
// Ensure that all paths are cleaned (especially problematic ones like
// "/../../../../../" which can cause lots of issues).
if filepath.IsAbs(path) {
return filepath.Clean(path)
}
// If the path isn't absolute, we need to do more processing to fix paths
// such as "../../../../<etc>/some/path". We also shouldn't convert absolute
// paths to relative ones.
path = filepath.Clean(string(os.PathSeparator) + path)
// This can't fail, as (by definition) all paths are relative to root.
path, _ = filepath.Rel(string(os.PathSeparator), path)
return path
}

View File

@ -5,17 +5,16 @@ import (
"fmt"
"path/filepath"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fs"
"github.com/opencontainers/runc/libcontainer/cgroups/fs2"
"github.com/opencontainers/runc/libcontainer/cgroups/systemd"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
"github.com/opencontainers/cgroups/fs"
"github.com/opencontainers/cgroups/fs2"
"github.com/opencontainers/cgroups/systemd"
)
// New returns the instance of a cgroup manager, which is chosen
// based on the local environment (whether cgroup v1 or v2 is used)
// and the config (whether config.Systemd is set or not).
func New(config *configs.Cgroup) (cgroups.Manager, error) {
func New(config *cgroups.Cgroup) (cgroups.Manager, error) {
return NewWithPaths(config, nil)
}
@ -27,7 +26,7 @@ func New(config *configs.Cgroup) (cgroups.Manager, error) {
//
// For cgroup v2, the only key allowed is "" (empty string), and the value
// is the unified cgroup path.
func NewWithPaths(config *configs.Cgroup, paths map[string]string) (cgroups.Manager, error) {
func NewWithPaths(config *cgroups.Cgroup, paths map[string]string) (cgroups.Manager, error) {
if config == nil {
return nil, errors.New("cgroups/manager.New: config must not be nil")
}

View File

@ -15,8 +15,7 @@ import (
dbus "github.com/godbus/dbus/v5"
"github.com/sirupsen/logrus"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
)
const (
@ -35,10 +34,10 @@ var (
// GenerateDeviceProps is a function to generate systemd device
// properties, used by Set methods. Unless
// [github.com/opencontainers/runc/libcontainer/cgroups/devices]
// [github.com/opencontainers/cgroups/devices]
// package is imported, it is set to nil, so cgroup managers can't
// configure devices.
GenerateDeviceProps func(r *configs.Resources, sdVer int) ([]systemdDbus.Property, error)
GenerateDeviceProps func(r *cgroups.Resources, sdVer int) ([]systemdDbus.Property, error)
)
// NOTE: This function comes from package github.com/coreos/go-systemd/util
@ -97,7 +96,7 @@ func newProp(name string, units interface{}) systemdDbus.Property {
}
}
func getUnitName(c *configs.Cgroup) string {
func getUnitName(c *cgroups.Cgroup) string {
// by default, we create a scope unless the user explicitly asks for a slice.
if !strings.HasSuffix(c.Name, ".slice") {
return c.ScopePrefix + "-" + c.Name + ".scope"
@ -351,7 +350,7 @@ func addCpuset(cm *dbusConnManager, props *[]systemdDbus.Property, cpus, mems st
// generateDeviceProperties takes the configured device rules and generates a
// corresponding set of systemd properties to configure the devices correctly.
func generateDeviceProperties(r *configs.Resources, cm *dbusConnManager) ([]systemdDbus.Property, error) {
func generateDeviceProperties(r *cgroups.Resources, cm *dbusConnManager) ([]systemdDbus.Property, error) {
if GenerateDeviceProps == nil {
if len(r.Devices) > 0 {
return nil, cgroups.ErrDevicesUnsupported

View File

@ -5,7 +5,7 @@ import (
dbus "github.com/godbus/dbus/v5"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
)
// freezeBeforeSet answers whether there is a need to freeze the cgroup before
@ -17,7 +17,7 @@ import (
// (unlike our fs driver, they will happily write deny-all rules to running
// containers). So we have to freeze the container to avoid the container get
// an occasional "permission denied" error.
func (m *LegacyManager) freezeBeforeSet(unitName string, r *configs.Resources) (needsFreeze, needsThaw bool, err error) {
func (m *LegacyManager) freezeBeforeSet(unitName string, r *cgroups.Resources) (needsFreeze, needsThaw bool, err error) {
// Special case for SkipDevices, as used by Kubernetes to create pod
// cgroups with allow-all device policy).
if r.SkipDevices {
@ -60,13 +60,13 @@ func (m *LegacyManager) freezeBeforeSet(unitName string, r *configs.Resources) (
if err != nil {
return
}
if freezerState == configs.Frozen {
if freezerState == cgroups.Frozen {
// Already frozen, and should stay frozen.
needsFreeze = false
needsThaw = false
}
if r.Freezer == configs.Frozen {
if r.Freezer == cgroups.Frozen {
// Will be frozen anyway -- no need to thaw.
needsThaw = false
}

View File

@ -61,8 +61,7 @@ func DetectUID() (int, error) {
scanner := bufio.NewScanner(bytes.NewReader(b))
for scanner.Scan() {
s := strings.TrimSpace(scanner.Text())
if strings.HasPrefix(s, "OwnerUID=") {
uidStr := strings.TrimPrefix(s, "OwnerUID=")
if uidStr, ok := strings.CutPrefix(s, "OwnerUID="); ok {
i, err := strconv.Atoi(uidStr)
if err != nil {
return -1, fmt.Errorf("could not detect the OwnerUID: %w", err)

View File

@ -10,19 +10,18 @@ import (
systemdDbus "github.com/coreos/go-systemd/v22/dbus"
"github.com/sirupsen/logrus"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fs"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
"github.com/opencontainers/cgroups/fs"
)
type LegacyManager struct {
mu sync.Mutex
cgroups *configs.Cgroup
cgroups *cgroups.Cgroup
paths map[string]string
dbus *dbusConnManager
}
func NewLegacyManager(cg *configs.Cgroup, paths map[string]string) (*LegacyManager, error) {
func NewLegacyManager(cg *cgroups.Cgroup, paths map[string]string) (*LegacyManager, error) {
if cg.Rootless {
return nil, errors.New("cannot use rootless systemd cgroups manager on cgroup v1")
}
@ -49,7 +48,7 @@ type subsystem interface {
// GetStats returns the stats, as 'stats', corresponding to the cgroup under 'path'.
GetStats(path string, stats *cgroups.Stats) error
// Set sets cgroup resource limits.
Set(path string, r *configs.Resources) error
Set(path string, r *cgroups.Resources) error
}
var errSubsystemDoesNotExist = errors.New("cgroup: subsystem does not exist")
@ -72,7 +71,7 @@ var legacySubsystems = []subsystem{
&fs.NameGroup{GroupName: "misc"},
}
func genV1ResourcesProperties(r *configs.Resources, cm *dbusConnManager) ([]systemdDbus.Property, error) {
func genV1ResourcesProperties(r *cgroups.Resources, cm *dbusConnManager) ([]systemdDbus.Property, error) {
var properties []systemdDbus.Property
deviceProperties, err := generateDeviceProperties(r, cm)
@ -112,7 +111,7 @@ func genV1ResourcesProperties(r *configs.Resources, cm *dbusConnManager) ([]syst
}
// initPaths figures out and returns paths to cgroups.
func initPaths(c *configs.Cgroup) (map[string]string, error) {
func initPaths(c *cgroups.Cgroup) (map[string]string, error) {
slice := "system.slice"
if c.Parent != "" {
var err error
@ -275,7 +274,7 @@ func getSubsystemPath(slice, unit, subsystem string) (string, error) {
return filepath.Join(mountpoint, slice, unit), nil
}
func (m *LegacyManager) Freeze(state configs.FreezerState) error {
func (m *LegacyManager) Freeze(state cgroups.FreezerState) error {
err := m.doFreeze(state)
if err == nil {
m.cgroups.Resources.Freezer = state
@ -285,13 +284,13 @@ func (m *LegacyManager) Freeze(state configs.FreezerState) error {
// doFreeze is the same as Freeze but without
// changing the m.cgroups.Resources.Frozen field.
func (m *LegacyManager) doFreeze(state configs.FreezerState) error {
func (m *LegacyManager) doFreeze(state cgroups.FreezerState) error {
path, ok := m.paths["freezer"]
if !ok {
return errSubsystemDoesNotExist
}
freezer := &fs.FreezerGroup{}
resources := &configs.Resources{Freezer: state}
resources := &cgroups.Resources{Freezer: state}
return freezer.Set(path, resources)
}
@ -328,7 +327,7 @@ func (m *LegacyManager) GetStats() (*cgroups.Stats, error) {
return stats, nil
}
func (m *LegacyManager) Set(r *configs.Resources) error {
func (m *LegacyManager) Set(r *cgroups.Resources) error {
if r == nil {
return nil
}
@ -347,13 +346,13 @@ func (m *LegacyManager) Set(r *configs.Resources) error {
}
if needsFreeze {
if err := m.doFreeze(configs.Frozen); err != nil {
if err := m.doFreeze(cgroups.Frozen); err != nil {
// If freezer cgroup isn't supported, we just warn about it.
logrus.Infof("freeze container before SetUnitProperties failed: %v", err)
// skip update the cgroup while frozen failed. #3803
if !errors.Is(err, errSubsystemDoesNotExist) {
if needsThaw {
if thawErr := m.doFreeze(configs.Thawed); thawErr != nil {
if thawErr := m.doFreeze(cgroups.Thawed); thawErr != nil {
logrus.Infof("thaw container after doFreeze failed: %v", thawErr)
}
}
@ -363,7 +362,7 @@ func (m *LegacyManager) Set(r *configs.Resources) error {
}
setErr := setUnitProperties(m.dbus, unitName, properties...)
if needsThaw {
if err := m.doFreeze(configs.Thawed); err != nil {
if err := m.doFreeze(cgroups.Thawed); err != nil {
logrus.Infof("thaw container after SetUnitProperties failed: %v", err)
}
}
@ -391,14 +390,14 @@ func (m *LegacyManager) GetPaths() map[string]string {
return m.paths
}
func (m *LegacyManager) GetCgroups() (*configs.Cgroup, error) {
func (m *LegacyManager) GetCgroups() (*cgroups.Cgroup, error) {
return m.cgroups, nil
}
func (m *LegacyManager) GetFreezerState() (configs.FreezerState, error) {
func (m *LegacyManager) GetFreezerState() (cgroups.FreezerState, error) {
path, ok := m.paths["freezer"]
if !ok {
return configs.Undefined, nil
return cgroups.Undefined, nil
}
freezer := &fs.FreezerGroup{}
return freezer.GetState(path)

View File

@ -15,9 +15,8 @@ import (
securejoin "github.com/cyphar/filepath-securejoin"
"github.com/sirupsen/logrus"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fs2"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/cgroups"
"github.com/opencontainers/cgroups/fs2"
)
const (
@ -26,14 +25,14 @@ const (
type UnifiedManager struct {
mu sync.Mutex
cgroups *configs.Cgroup
cgroups *cgroups.Cgroup
// path is like "/sys/fs/cgroup/user.slice/user-1001.slice/session-1.scope"
path string
dbus *dbusConnManager
fsMgr cgroups.Manager
}
func NewUnifiedManager(config *configs.Cgroup, path string) (*UnifiedManager, error) {
func NewUnifiedManager(config *cgroups.Cgroup, path string) (*UnifiedManager, error) {
m := &UnifiedManager{
cgroups: config,
path: path,
@ -199,7 +198,7 @@ func unifiedResToSystemdProps(cm *dbusConnManager, res map[string]string) (props
return props, nil
}
func genV2ResourcesProperties(dirPath string, r *configs.Resources, cm *dbusConnManager) ([]systemdDbus.Property, error) {
func genV2ResourcesProperties(dirPath string, r *cgroups.Resources, cm *dbusConnManager) ([]systemdDbus.Property, error) {
// We need this check before setting systemd properties, otherwise
// the container is OOM-killed and the systemd unit is removed
// before we get to fsMgr.Set().
@ -461,7 +460,7 @@ func (m *UnifiedManager) initPath() error {
return nil
}
func (m *UnifiedManager) Freeze(state configs.FreezerState) error {
func (m *UnifiedManager) Freeze(state cgroups.FreezerState) error {
return m.fsMgr.Freeze(state)
}
@ -477,7 +476,7 @@ func (m *UnifiedManager) GetStats() (*cgroups.Stats, error) {
return m.fsMgr.GetStats()
}
func (m *UnifiedManager) Set(r *configs.Resources) error {
func (m *UnifiedManager) Set(r *cgroups.Resources) error {
if r == nil {
return nil
}
@ -499,11 +498,11 @@ func (m *UnifiedManager) GetPaths() map[string]string {
return paths
}
func (m *UnifiedManager) GetCgroups() (*configs.Cgroup, error) {
func (m *UnifiedManager) GetCgroups() (*cgroups.Cgroup, error) {
return m.cgroups, nil
}
func (m *UnifiedManager) GetFreezerState() (configs.FreezerState, error) {
func (m *UnifiedManager) GetFreezerState() (cgroups.FreezerState, error) {
return m.fsMgr.GetFreezerState()
}

View File

@ -251,27 +251,39 @@ again:
// RemovePath aims to remove cgroup path. It does so recursively,
// by removing any subdirectories (sub-cgroups) first.
func RemovePath(path string) error {
// Try the fast path first.
// Try the fast path first; don't retry on EBUSY yet.
if err := rmdir(path, false); err == nil {
return nil
}
// There are many reasons why rmdir can fail, including:
// 1. cgroup have existing sub-cgroups;
// 2. cgroup (still) have some processes (that are about to vanish);
// 3. lack of permission (one example is read-only /sys/fs/cgroup mount,
// in which case rmdir returns EROFS even for for a non-existent path,
// see issue 4518).
//
// Using os.ReadDir here kills two birds with one stone: check if
// the directory exists (handling scenario 3 above), and use
// directory contents to remove sub-cgroups (handling scenario 1).
infos, err := os.ReadDir(path)
if err != nil && !os.IsNotExist(err) {
if err != nil {
if os.IsNotExist(err) {
return nil
}
return err
}
// Let's remove sub-cgroups, if any.
for _, info := range infos {
if info.IsDir() {
// We should remove subcgroup first.
if err = RemovePath(filepath.Join(path, info.Name())); err != nil {
break
return err
}
}
}
if err == nil {
err = rmdir(path, true)
}
return err
// Finally, try rmdir again, this time with retries on EBUSY,
// which may help with scenario 2 above.
return rmdir(path, true)
}
// RemovePaths iterates over the provided paths removing them.
@ -320,8 +332,8 @@ func getHugePageSizeFromFilenames(fileNames []string) ([]string, error) {
for _, file := range fileNames {
// example: hugepages-1048576kB
val := strings.TrimPrefix(file, "hugepages-")
if len(val) == len(file) {
val, ok := strings.CutPrefix(file, "hugepages-")
if !ok {
// Unexpected file name: no prefix found, ignore it.
continue
}

View File

@ -176,7 +176,7 @@
END OF TERMS AND CONDITIONS
Copyright 2014 Docker, Inc.
Copyright 2016 The Linux Foundation.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.

View File

@ -0,0 +1,62 @@
// Copyright 2016 The Linux Foundation
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package v1
const (
// AnnotationCreated is the annotation key for the date and time on which the image was built (date-time string as defined by RFC 3339).
AnnotationCreated = "org.opencontainers.image.created"
// AnnotationAuthors is the annotation key for the contact details of the people or organization responsible for the image (freeform string).
AnnotationAuthors = "org.opencontainers.image.authors"
// AnnotationURL is the annotation key for the URL to find more information on the image.
AnnotationURL = "org.opencontainers.image.url"
// AnnotationDocumentation is the annotation key for the URL to get documentation on the image.
AnnotationDocumentation = "org.opencontainers.image.documentation"
// AnnotationSource is the annotation key for the URL to get source code for building the image.
AnnotationSource = "org.opencontainers.image.source"
// AnnotationVersion is the annotation key for the version of the packaged software.
// The version MAY match a label or tag in the source code repository.
// The version MAY be Semantic versioning-compatible.
AnnotationVersion = "org.opencontainers.image.version"
// AnnotationRevision is the annotation key for the source control revision identifier for the packaged software.
AnnotationRevision = "org.opencontainers.image.revision"
// AnnotationVendor is the annotation key for the name of the distributing entity, organization or individual.
AnnotationVendor = "org.opencontainers.image.vendor"
// AnnotationLicenses is the annotation key for the license(s) under which contained software is distributed as an SPDX License Expression.
AnnotationLicenses = "org.opencontainers.image.licenses"
// AnnotationRefName is the annotation key for the name of the reference for a target.
// SHOULD only be considered valid when on descriptors on `index.json` within image layout.
AnnotationRefName = "org.opencontainers.image.ref.name"
// AnnotationTitle is the annotation key for the human-readable title of the image.
AnnotationTitle = "org.opencontainers.image.title"
// AnnotationDescription is the annotation key for the human-readable description of the software packaged in the image.
AnnotationDescription = "org.opencontainers.image.description"
// AnnotationBaseImageDigest is the annotation key for the digest of the image's base image.
AnnotationBaseImageDigest = "org.opencontainers.image.base.digest"
// AnnotationBaseImageName is the annotation key for the image reference of the image's base image.
AnnotationBaseImageName = "org.opencontainers.image.base.name"
)

View File

@ -0,0 +1,111 @@
// Copyright 2016 The Linux Foundation
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package v1
import (
"time"
digest "github.com/opencontainers/go-digest"
)
// ImageConfig defines the execution parameters which should be used as a base when running a container using an image.
type ImageConfig struct {
// User defines the username or UID which the process in the container should run as.
User string `json:"User,omitempty"`
// ExposedPorts a set of ports to expose from a container running this image.
ExposedPorts map[string]struct{} `json:"ExposedPorts,omitempty"`
// Env is a list of environment variables to be used in a container.
Env []string `json:"Env,omitempty"`
// Entrypoint defines a list of arguments to use as the command to execute when the container starts.
Entrypoint []string `json:"Entrypoint,omitempty"`
// Cmd defines the default arguments to the entrypoint of the container.
Cmd []string `json:"Cmd,omitempty"`
// Volumes is a set of directories describing where the process is likely write data specific to a container instance.
Volumes map[string]struct{} `json:"Volumes,omitempty"`
// WorkingDir sets the current working directory of the entrypoint process in the container.
WorkingDir string `json:"WorkingDir,omitempty"`
// Labels contains arbitrary metadata for the container.
Labels map[string]string `json:"Labels,omitempty"`
// StopSignal contains the system call signal that will be sent to the container to exit.
StopSignal string `json:"StopSignal,omitempty"`
// ArgsEscaped
//
// Deprecated: This field is present only for legacy compatibility with
// Docker and should not be used by new image builders. It is used by Docker
// for Windows images to indicate that the `Entrypoint` or `Cmd` or both,
// contains only a single element array, that is a pre-escaped, and combined
// into a single string `CommandLine`. If `true` the value in `Entrypoint` or
// `Cmd` should be used as-is to avoid double escaping.
// https://github.com/opencontainers/image-spec/pull/892
ArgsEscaped bool `json:"ArgsEscaped,omitempty"`
}
// RootFS describes a layer content addresses
type RootFS struct {
// Type is the type of the rootfs.
Type string `json:"type"`
// DiffIDs is an array of layer content hashes (DiffIDs), in order from bottom-most to top-most.
DiffIDs []digest.Digest `json:"diff_ids"`
}
// History describes the history of a layer.
type History struct {
// Created is the combined date and time at which the layer was created, formatted as defined by RFC 3339, section 5.6.
Created *time.Time `json:"created,omitempty"`
// CreatedBy is the command which created the layer.
CreatedBy string `json:"created_by,omitempty"`
// Author is the author of the build point.
Author string `json:"author,omitempty"`
// Comment is a custom message set when creating the layer.
Comment string `json:"comment,omitempty"`
// EmptyLayer is used to mark if the history item created a filesystem diff.
EmptyLayer bool `json:"empty_layer,omitempty"`
}
// Image is the JSON structure which describes some basic information about the image.
// This provides the `application/vnd.oci.image.config.v1+json` mediatype when marshalled to JSON.
type Image struct {
// Created is the combined date and time at which the image was created, formatted as defined by RFC 3339, section 5.6.
Created *time.Time `json:"created,omitempty"`
// Author defines the name and/or email address of the person or entity which created and is responsible for maintaining the image.
Author string `json:"author,omitempty"`
// Platform describes the platform which the image in the manifest runs on.
Platform
// Config defines the execution parameters which should be used as a base when running a container using the image.
Config ImageConfig `json:"config,omitempty"`
// RootFS references the layer content addresses used by the image.
RootFS RootFS `json:"rootfs"`
// History describes the history of each layer.
History []History `json:"history,omitempty"`
}

View File

@ -0,0 +1,80 @@
// Copyright 2016-2022 The Linux Foundation
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package v1
import digest "github.com/opencontainers/go-digest"
// Descriptor describes the disposition of targeted content.
// This structure provides `application/vnd.oci.descriptor.v1+json` mediatype
// when marshalled to JSON.
type Descriptor struct {
// MediaType is the media type of the object this schema refers to.
MediaType string `json:"mediaType"`
// Digest is the digest of the targeted content.
Digest digest.Digest `json:"digest"`
// Size specifies the size in bytes of the blob.
Size int64 `json:"size"`
// URLs specifies a list of URLs from which this object MAY be downloaded
URLs []string `json:"urls,omitempty"`
// Annotations contains arbitrary metadata relating to the targeted content.
Annotations map[string]string `json:"annotations,omitempty"`
// Data is an embedding of the targeted content. This is encoded as a base64
// string when marshalled to JSON (automatically, by encoding/json). If
// present, Data can be used directly to avoid fetching the targeted content.
Data []byte `json:"data,omitempty"`
// Platform describes the platform which the image in the manifest runs on.
//
// This should only be used when referring to a manifest.
Platform *Platform `json:"platform,omitempty"`
// ArtifactType is the IANA media type of this artifact.
ArtifactType string `json:"artifactType,omitempty"`
}
// Platform describes the platform which the image in the manifest runs on.
type Platform struct {
// Architecture field specifies the CPU architecture, for example
// `amd64` or `ppc64le`.
Architecture string `json:"architecture"`
// OS specifies the operating system, for example `linux` or `windows`.
OS string `json:"os"`
// OSVersion is an optional field specifying the operating system
// version, for example on Windows `10.0.14393.1066`.
OSVersion string `json:"os.version,omitempty"`
// OSFeatures is an optional field specifying an array of strings,
// each listing a required OS feature (for example on Windows `win32k`).
OSFeatures []string `json:"os.features,omitempty"`
// Variant is an optional field specifying a variant of the CPU, for
// example `v7` to specify ARMv7 when architecture is `arm`.
Variant string `json:"variant,omitempty"`
}
// DescriptorEmptyJSON is the descriptor of a blob with content of `{}`.
var DescriptorEmptyJSON = Descriptor{
MediaType: MediaTypeEmptyJSON,
Digest: `sha256:44136fa355b3678a1146ad16f7e8649e94fb4fc21fe77e8310c060f61caaff8a`,
Size: 2,
Data: []byte(`{}`),
}

View File

@ -0,0 +1,38 @@
// Copyright 2016 The Linux Foundation
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package v1
import "github.com/opencontainers/image-spec/specs-go"
// Index references manifests for various platforms.
// This structure provides `application/vnd.oci.image.index.v1+json` mediatype when marshalled to JSON.
type Index struct {
specs.Versioned
// MediaType specifies the type of this document data structure e.g. `application/vnd.oci.image.index.v1+json`
MediaType string `json:"mediaType,omitempty"`
// ArtifactType specifies the IANA media type of artifact when the manifest is used for an artifact.
ArtifactType string `json:"artifactType,omitempty"`
// Manifests references platform specific manifests.
Manifests []Descriptor `json:"manifests"`
// Subject is an optional link from the image manifest to another manifest forming an association between the image manifest and the other manifest.
Subject *Descriptor `json:"subject,omitempty"`
// Annotations contains arbitrary metadata for the image index.
Annotations map[string]string `json:"annotations,omitempty"`
}

View File

@ -0,0 +1,32 @@
// Copyright 2016 The Linux Foundation
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package v1
const (
// ImageLayoutFile is the file name containing ImageLayout in an OCI Image Layout
ImageLayoutFile = "oci-layout"
// ImageLayoutVersion is the version of ImageLayout
ImageLayoutVersion = "1.0.0"
// ImageIndexFile is the file name of the entry point for references and descriptors in an OCI Image Layout
ImageIndexFile = "index.json"
// ImageBlobsDir is the directory name containing content addressable blobs in an OCI Image Layout
ImageBlobsDir = "blobs"
)
// ImageLayout is the structure in the "oci-layout" file, found in the root
// of an OCI Image-layout directory.
type ImageLayout struct {
Version string `json:"imageLayoutVersion"`
}

View File

@ -0,0 +1,41 @@
// Copyright 2016-2022 The Linux Foundation
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package v1
import "github.com/opencontainers/image-spec/specs-go"
// Manifest provides `application/vnd.oci.image.manifest.v1+json` mediatype structure when marshalled to JSON.
type Manifest struct {
specs.Versioned
// MediaType specifies the type of this document data structure e.g. `application/vnd.oci.image.manifest.v1+json`
MediaType string `json:"mediaType,omitempty"`
// ArtifactType specifies the IANA media type of artifact when the manifest is used for an artifact.
ArtifactType string `json:"artifactType,omitempty"`
// Config references a configuration object for a container, by digest.
// The referenced configuration object is a JSON blob that the runtime uses to set up the container.
Config Descriptor `json:"config"`
// Layers is an indexed list of layers referenced by the manifest.
Layers []Descriptor `json:"layers"`
// Subject is an optional link from the image manifest to another manifest forming an association between the image manifest and the other manifest.
Subject *Descriptor `json:"subject,omitempty"`
// Annotations contains arbitrary metadata for the image manifest.
Annotations map[string]string `json:"annotations,omitempty"`
}

View File

@ -0,0 +1,85 @@
// Copyright 2016 The Linux Foundation
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package v1
const (
// MediaTypeDescriptor specifies the media type for a content descriptor.
MediaTypeDescriptor = "application/vnd.oci.descriptor.v1+json"
// MediaTypeLayoutHeader specifies the media type for the oci-layout.
MediaTypeLayoutHeader = "application/vnd.oci.layout.header.v1+json"
// MediaTypeImageIndex specifies the media type for an image index.
MediaTypeImageIndex = "application/vnd.oci.image.index.v1+json"
// MediaTypeImageManifest specifies the media type for an image manifest.
MediaTypeImageManifest = "application/vnd.oci.image.manifest.v1+json"
// MediaTypeImageConfig specifies the media type for the image configuration.
MediaTypeImageConfig = "application/vnd.oci.image.config.v1+json"
// MediaTypeEmptyJSON specifies the media type for an unused blob containing the value "{}".
MediaTypeEmptyJSON = "application/vnd.oci.empty.v1+json"
)
const (
// MediaTypeImageLayer is the media type used for layers referenced by the manifest.
MediaTypeImageLayer = "application/vnd.oci.image.layer.v1.tar"
// MediaTypeImageLayerGzip is the media type used for gzipped layers
// referenced by the manifest.
MediaTypeImageLayerGzip = "application/vnd.oci.image.layer.v1.tar+gzip"
// MediaTypeImageLayerZstd is the media type used for zstd compressed
// layers referenced by the manifest.
MediaTypeImageLayerZstd = "application/vnd.oci.image.layer.v1.tar+zstd"
)
// Non-distributable layer media-types.
//
// Deprecated: Non-distributable layers are deprecated, and not recommended
// for future use. Implementations SHOULD NOT produce new non-distributable
// layers.
// https://github.com/opencontainers/image-spec/pull/965
const (
// MediaTypeImageLayerNonDistributable is the media type for layers referenced by
// the manifest but with distribution restrictions.
//
// Deprecated: Non-distributable layers are deprecated, and not recommended
// for future use. Implementations SHOULD NOT produce new non-distributable
// layers.
// https://github.com/opencontainers/image-spec/pull/965
MediaTypeImageLayerNonDistributable = "application/vnd.oci.image.layer.nondistributable.v1.tar"
// MediaTypeImageLayerNonDistributableGzip is the media type for
// gzipped layers referenced by the manifest but with distribution
// restrictions.
//
// Deprecated: Non-distributable layers are deprecated, and not recommended
// for future use. Implementations SHOULD NOT produce new non-distributable
// layers.
// https://github.com/opencontainers/image-spec/pull/965
MediaTypeImageLayerNonDistributableGzip = "application/vnd.oci.image.layer.nondistributable.v1.tar+gzip"
// MediaTypeImageLayerNonDistributableZstd is the media type for zstd
// compressed layers referenced by the manifest but with distribution
// restrictions.
//
// Deprecated: Non-distributable layers are deprecated, and not recommended
// for future use. Implementations SHOULD NOT produce new non-distributable
// layers.
// https://github.com/opencontainers/image-spec/pull/965
MediaTypeImageLayerNonDistributableZstd = "application/vnd.oci.image.layer.nondistributable.v1.tar+zstd"
)

View File

@ -0,0 +1,32 @@
// Copyright 2016 The Linux Foundation
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package specs
import "fmt"
const (
// VersionMajor is for an API incompatible changes
VersionMajor = 1
// VersionMinor is for functionality in a backwards-compatible manner
VersionMinor = 1
// VersionPatch is for backwards-compatible bug fixes
VersionPatch = 1
// VersionDev indicates development branch. Releases will be empty string.
VersionDev = ""
)
// Version is the specification version that the package types support.
var Version = fmt.Sprintf("%d.%d.%d%s", VersionMajor, VersionMinor, VersionPatch, VersionDev)

View File

@ -0,0 +1,23 @@
// Copyright 2016 The Linux Foundation
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package specs
// Versioned provides a struct with the manifest schemaVersion and mediaType.
// Incoming content with unknown schema version can be decoded against this
// struct to check the version.
type Versioned struct {
// SchemaVersion is the image manifest schema that this image follows
SchemaVersion int `json:"schemaVersion"`
}

View File

@ -1,17 +0,0 @@
runc
Copyright 2012-2015 Docker, Inc.
This product includes software developed at Docker, Inc. (http://www.docker.com).
The following is courtesy of our legal counsel:
Use and transfer of Docker may be subject to certain restrictions by the
United States and other governments.
It is your responsibility to ensure that your use and/or transfer does not
violate applicable laws.
For more information, please see http://www.bis.doc.gov
See also http://www.apache.org/dev/crypto.html and/or seek legal counsel.

View File

@ -1,508 +0,0 @@
package configs
import (
"bytes"
"encoding/json"
"fmt"
"os/exec"
"time"
"github.com/sirupsen/logrus"
"golang.org/x/sys/unix"
"github.com/opencontainers/runc/libcontainer/devices"
"github.com/opencontainers/runtime-spec/specs-go"
)
type Rlimit struct {
Type int `json:"type"`
Hard uint64 `json:"hard"`
Soft uint64 `json:"soft"`
}
// IDMap represents UID/GID Mappings for User Namespaces.
type IDMap struct {
ContainerID int64 `json:"container_id"`
HostID int64 `json:"host_id"`
Size int64 `json:"size"`
}
// Seccomp represents syscall restrictions
// By default, only the native architecture of the kernel is allowed to be used
// for syscalls. Additional architectures can be added by specifying them in
// Architectures.
type Seccomp struct {
DefaultAction Action `json:"default_action"`
Architectures []string `json:"architectures"`
Flags []specs.LinuxSeccompFlag `json:"flags"`
Syscalls []*Syscall `json:"syscalls"`
DefaultErrnoRet *uint `json:"default_errno_ret"`
ListenerPath string `json:"listener_path,omitempty"`
ListenerMetadata string `json:"listener_metadata,omitempty"`
}
// Action is taken upon rule match in Seccomp
type Action int
const (
Kill Action = iota + 1
Errno
Trap
Allow
Trace
Log
Notify
KillThread
KillProcess
)
// Operator is a comparison operator to be used when matching syscall arguments in Seccomp
type Operator int
const (
EqualTo Operator = iota + 1
NotEqualTo
GreaterThan
GreaterThanOrEqualTo
LessThan
LessThanOrEqualTo
MaskEqualTo
)
// Arg is a rule to match a specific syscall argument in Seccomp
type Arg struct {
Index uint `json:"index"`
Value uint64 `json:"value"`
ValueTwo uint64 `json:"value_two"`
Op Operator `json:"op"`
}
// Syscall is a rule to match a syscall in Seccomp
type Syscall struct {
Name string `json:"name"`
Action Action `json:"action"`
ErrnoRet *uint `json:"errnoRet"`
Args []*Arg `json:"args"`
}
// Config defines configuration options for executing a process inside a contained environment.
type Config struct {
// NoPivotRoot will use MS_MOVE and a chroot to jail the process into the container's rootfs
// This is a common option when the container is running in ramdisk
NoPivotRoot bool `json:"no_pivot_root"`
// ParentDeathSignal specifies the signal that is sent to the container's process in the case
// that the parent process dies.
ParentDeathSignal int `json:"parent_death_signal"`
// Path to a directory containing the container's root filesystem.
Rootfs string `json:"rootfs"`
// Umask is the umask to use inside of the container.
Umask *uint32 `json:"umask"`
// Readonlyfs will remount the container's rootfs as readonly where only externally mounted
// bind mounts are writtable.
Readonlyfs bool `json:"readonlyfs"`
// Specifies the mount propagation flags to be applied to /.
RootPropagation int `json:"rootPropagation"`
// Mounts specify additional source and destination paths that will be mounted inside the container's
// rootfs and mount namespace if specified
Mounts []*Mount `json:"mounts"`
// The device nodes that should be automatically created within the container upon container start. Note, make sure that the node is marked as allowed in the cgroup as well!
Devices []*devices.Device `json:"devices"`
MountLabel string `json:"mount_label"`
// Hostname optionally sets the container's hostname if provided
Hostname string `json:"hostname"`
// Domainname optionally sets the container's domainname if provided
Domainname string `json:"domainname"`
// Namespaces specifies the container's namespaces that it should setup when cloning the init process
// If a namespace is not provided that namespace is shared from the container's parent process
Namespaces Namespaces `json:"namespaces"`
// Capabilities specify the capabilities to keep when executing the process inside the container
// All capabilities not specified will be dropped from the processes capability mask
Capabilities *Capabilities `json:"capabilities"`
// Networks specifies the container's network setup to be created
Networks []*Network `json:"networks"`
// Routes can be specified to create entries in the route table as the container is started
Routes []*Route `json:"routes"`
// Cgroups specifies specific cgroup settings for the various subsystems that the container is
// placed into to limit the resources the container has available
Cgroups *Cgroup `json:"cgroups"`
// AppArmorProfile specifies the profile to apply to the process running in the container and is
// change at the time the process is execed
AppArmorProfile string `json:"apparmor_profile,omitempty"`
// ProcessLabel specifies the label to apply to the process running in the container. It is
// commonly used by selinux
ProcessLabel string `json:"process_label,omitempty"`
// Rlimits specifies the resource limits, such as max open files, to set in the container
// If Rlimits are not set, the container will inherit rlimits from the parent process
Rlimits []Rlimit `json:"rlimits,omitempty"`
// OomScoreAdj specifies the adjustment to be made by the kernel when calculating oom scores
// for a process. Valid values are between the range [-1000, '1000'], where processes with
// higher scores are preferred for being killed. If it is unset then we don't touch the current
// value.
// More information about kernel oom score calculation here: https://lwn.net/Articles/317814/
OomScoreAdj *int `json:"oom_score_adj,omitempty"`
// UIDMappings is an array of User ID mappings for User Namespaces
UIDMappings []IDMap `json:"uid_mappings"`
// GIDMappings is an array of Group ID mappings for User Namespaces
GIDMappings []IDMap `json:"gid_mappings"`
// MaskPaths specifies paths within the container's rootfs to mask over with a bind
// mount pointing to /dev/null as to prevent reads of the file.
MaskPaths []string `json:"mask_paths"`
// ReadonlyPaths specifies paths within the container's rootfs to remount as read-only
// so that these files prevent any writes.
ReadonlyPaths []string `json:"readonly_paths"`
// Sysctl is a map of properties and their values. It is the equivalent of using
// sysctl -w my.property.name value in Linux.
Sysctl map[string]string `json:"sysctl"`
// Seccomp allows actions to be taken whenever a syscall is made within the container.
// A number of rules are given, each having an action to be taken if a syscall matches it.
// A default action to be taken if no rules match is also given.
Seccomp *Seccomp `json:"seccomp"`
// NoNewPrivileges controls whether processes in the container can gain additional privileges.
NoNewPrivileges bool `json:"no_new_privileges,omitempty"`
// Hooks are a collection of actions to perform at various container lifecycle events.
// CommandHooks are serialized to JSON, but other hooks are not.
Hooks Hooks
// Version is the version of opencontainer specification that is supported.
Version string `json:"version"`
// Labels are user defined metadata that is stored in the config and populated on the state
Labels []string `json:"labels"`
// NoNewKeyring will not allocated a new session keyring for the container. It will use the
// callers keyring in this case.
NoNewKeyring bool `json:"no_new_keyring"`
// IntelRdt specifies settings for Intel RDT group that the container is placed into
// to limit the resources (e.g., L3 cache, memory bandwidth) the container has available
IntelRdt *IntelRdt `json:"intel_rdt,omitempty"`
// RootlessEUID is set when the runc was launched with non-zero EUID.
// Note that RootlessEUID is set to false when launched with EUID=0 in userns.
// When RootlessEUID is set, runc creates a new userns for the container.
// (config.json needs to contain userns settings)
RootlessEUID bool `json:"rootless_euid,omitempty"`
// RootlessCgroups is set when unlikely to have the full access to cgroups.
// When RootlessCgroups is set, cgroups errors are ignored.
RootlessCgroups bool `json:"rootless_cgroups,omitempty"`
// TimeOffsets specifies the offset for supporting time namespaces.
TimeOffsets map[string]specs.LinuxTimeOffset `json:"time_offsets,omitempty"`
// Scheduler represents the scheduling attributes for a process.
Scheduler *Scheduler `json:"scheduler,omitempty"`
// Personality contains configuration for the Linux personality syscall.
Personality *LinuxPersonality `json:"personality,omitempty"`
// IOPriority is the container's I/O priority.
IOPriority *IOPriority `json:"io_priority,omitempty"`
}
// Scheduler is based on the Linux sched_setattr(2) syscall.
type Scheduler = specs.Scheduler
// ToSchedAttr is to convert *configs.Scheduler to *unix.SchedAttr.
func ToSchedAttr(scheduler *Scheduler) (*unix.SchedAttr, error) {
var policy uint32
switch scheduler.Policy {
case specs.SchedOther:
policy = 0
case specs.SchedFIFO:
policy = 1
case specs.SchedRR:
policy = 2
case specs.SchedBatch:
policy = 3
case specs.SchedISO:
policy = 4
case specs.SchedIdle:
policy = 5
case specs.SchedDeadline:
policy = 6
default:
return nil, fmt.Errorf("invalid scheduler policy: %s", scheduler.Policy)
}
var flags uint64
for _, flag := range scheduler.Flags {
switch flag {
case specs.SchedFlagResetOnFork:
flags |= 0x01
case specs.SchedFlagReclaim:
flags |= 0x02
case specs.SchedFlagDLOverrun:
flags |= 0x04
case specs.SchedFlagKeepPolicy:
flags |= 0x08
case specs.SchedFlagKeepParams:
flags |= 0x10
case specs.SchedFlagUtilClampMin:
flags |= 0x20
case specs.SchedFlagUtilClampMax:
flags |= 0x40
default:
return nil, fmt.Errorf("invalid scheduler flag: %s", flag)
}
}
return &unix.SchedAttr{
Size: unix.SizeofSchedAttr,
Policy: policy,
Flags: flags,
Nice: scheduler.Nice,
Priority: uint32(scheduler.Priority),
Runtime: scheduler.Runtime,
Deadline: scheduler.Deadline,
Period: scheduler.Period,
}, nil
}
var IOPrioClassMapping = map[specs.IOPriorityClass]int{
specs.IOPRIO_CLASS_RT: 1,
specs.IOPRIO_CLASS_BE: 2,
specs.IOPRIO_CLASS_IDLE: 3,
}
type IOPriority = specs.LinuxIOPriority
type (
HookName string
HookList []Hook
Hooks map[HookName]HookList
)
const (
// Prestart commands are executed after the container namespaces are created,
// but before the user supplied command is executed from init.
// Note: This hook is now deprecated
// Prestart commands are called in the Runtime namespace.
Prestart HookName = "prestart"
// CreateRuntime commands MUST be called as part of the create operation after
// the runtime environment has been created but before the pivot_root has been executed.
// CreateRuntime is called immediately after the deprecated Prestart hook.
// CreateRuntime commands are called in the Runtime Namespace.
CreateRuntime HookName = "createRuntime"
// CreateContainer commands MUST be called as part of the create operation after
// the runtime environment has been created but before the pivot_root has been executed.
// CreateContainer commands are called in the Container namespace.
CreateContainer HookName = "createContainer"
// StartContainer commands MUST be called as part of the start operation and before
// the container process is started.
// StartContainer commands are called in the Container namespace.
StartContainer HookName = "startContainer"
// Poststart commands are executed after the container init process starts.
// Poststart commands are called in the Runtime Namespace.
Poststart HookName = "poststart"
// Poststop commands are executed after the container init process exits.
// Poststop commands are called in the Runtime Namespace.
Poststop HookName = "poststop"
)
// KnownHookNames returns the known hook names.
// Used by `runc features`.
func KnownHookNames() []string {
return []string{
string(Prestart), // deprecated
string(CreateRuntime),
string(CreateContainer),
string(StartContainer),
string(Poststart),
string(Poststop),
}
}
type Capabilities struct {
// Bounding is the set of capabilities checked by the kernel.
Bounding []string
// Effective is the set of capabilities checked by the kernel.
Effective []string
// Inheritable is the capabilities preserved across execve.
Inheritable []string
// Permitted is the limiting superset for effective capabilities.
Permitted []string
// Ambient is the ambient set of capabilities that are kept.
Ambient []string
}
// Deprecated: use (Hooks).Run instead.
func (hooks HookList) RunHooks(state *specs.State) error {
for i, h := range hooks {
if err := h.Run(state); err != nil {
return fmt.Errorf("error running hook #%d: %w", i, err)
}
}
return nil
}
func (hooks *Hooks) UnmarshalJSON(b []byte) error {
var state map[HookName][]CommandHook
if err := json.Unmarshal(b, &state); err != nil {
return err
}
*hooks = Hooks{}
for n, commandHooks := range state {
if len(commandHooks) == 0 {
continue
}
(*hooks)[n] = HookList{}
for _, h := range commandHooks {
(*hooks)[n] = append((*hooks)[n], h)
}
}
return nil
}
func (hooks *Hooks) MarshalJSON() ([]byte, error) {
serialize := func(hooks []Hook) (serializableHooks []CommandHook) {
for _, hook := range hooks {
switch chook := hook.(type) {
case CommandHook:
serializableHooks = append(serializableHooks, chook)
default:
logrus.Warnf("cannot serialize hook of type %T, skipping", hook)
}
}
return serializableHooks
}
return json.Marshal(map[string]interface{}{
"prestart": serialize((*hooks)[Prestart]),
"createRuntime": serialize((*hooks)[CreateRuntime]),
"createContainer": serialize((*hooks)[CreateContainer]),
"startContainer": serialize((*hooks)[StartContainer]),
"poststart": serialize((*hooks)[Poststart]),
"poststop": serialize((*hooks)[Poststop]),
})
}
// Run executes all hooks for the given hook name.
func (hooks Hooks) Run(name HookName, state *specs.State) error {
list := hooks[name]
for i, h := range list {
if err := h.Run(state); err != nil {
return fmt.Errorf("error running %s hook #%d: %w", name, i, err)
}
}
return nil
}
type Hook interface {
// Run executes the hook with the provided state.
Run(*specs.State) error
}
// NewFunctionHook will call the provided function when the hook is run.
func NewFunctionHook(f func(*specs.State) error) FuncHook {
return FuncHook{
run: f,
}
}
type FuncHook struct {
run func(*specs.State) error
}
func (f FuncHook) Run(s *specs.State) error {
return f.run(s)
}
type Command struct {
Path string `json:"path"`
Args []string `json:"args"`
Env []string `json:"env"`
Dir string `json:"dir"`
Timeout *time.Duration `json:"timeout"`
}
// NewCommandHook will execute the provided command when the hook is run.
func NewCommandHook(cmd Command) CommandHook {
return CommandHook{
Command: cmd,
}
}
type CommandHook struct {
Command
}
func (c Command) Run(s *specs.State) error {
b, err := json.Marshal(s)
if err != nil {
return err
}
var stdout, stderr bytes.Buffer
cmd := exec.Cmd{
Path: c.Path,
Args: c.Args,
Env: c.Env,
Stdin: bytes.NewReader(b),
Stdout: &stdout,
Stderr: &stderr,
}
if err := cmd.Start(); err != nil {
return err
}
errC := make(chan error, 1)
go func() {
err := cmd.Wait()
if err != nil {
err = fmt.Errorf("%w, stdout: %s, stderr: %s", err, stdout.String(), stderr.String())
}
errC <- err
}()
var timerCh <-chan time.Time
if c.Timeout != nil {
timer := time.NewTimer(*c.Timeout)
defer timer.Stop()
timerCh = timer.C
}
select {
case err := <-errC:
return err
case <-timerCh:
_ = cmd.Process.Kill()
<-errC
return fmt.Errorf("hook ran past specified timeout of %.1fs", c.Timeout.Seconds())
}
}

View File

@ -1,97 +0,0 @@
package configs
import (
"errors"
"fmt"
"math"
)
var (
errNoUIDMap = errors.New("user namespaces enabled, but no uid mappings found")
errNoGIDMap = errors.New("user namespaces enabled, but no gid mappings found")
)
// Please check https://man7.org/linux/man-pages/man2/personality.2.html for const details.
// https://raw.githubusercontent.com/torvalds/linux/master/include/uapi/linux/personality.h
const (
PerLinux = 0x0000
PerLinux32 = 0x0008
)
type LinuxPersonality struct {
// Domain for the personality
// can only contain values "LINUX" and "LINUX32"
Domain int `json:"domain"`
}
// HostUID gets the translated uid for the process on host which could be
// different when user namespaces are enabled.
func (c Config) HostUID(containerId int) (int, error) {
if c.Namespaces.Contains(NEWUSER) {
if len(c.UIDMappings) == 0 {
return -1, errNoUIDMap
}
id, found := c.hostIDFromMapping(int64(containerId), c.UIDMappings)
if !found {
return -1, fmt.Errorf("user namespaces enabled, but no mapping found for uid %d", containerId)
}
// If we are a 32-bit binary running on a 64-bit system, it's possible
// the mapped user is too large to store in an int, which means we
// cannot do the mapping. We can't just return an int64, because
// os.Setuid() takes an int.
if id > math.MaxInt {
return -1, fmt.Errorf("mapping for uid %d (host id %d) is larger than native integer size (%d)", containerId, id, math.MaxInt)
}
return int(id), nil
}
// Return unchanged id.
return containerId, nil
}
// HostRootUID gets the root uid for the process on host which could be non-zero
// when user namespaces are enabled.
func (c Config) HostRootUID() (int, error) {
return c.HostUID(0)
}
// HostGID gets the translated gid for the process on host which could be
// different when user namespaces are enabled.
func (c Config) HostGID(containerId int) (int, error) {
if c.Namespaces.Contains(NEWUSER) {
if len(c.GIDMappings) == 0 {
return -1, errNoGIDMap
}
id, found := c.hostIDFromMapping(int64(containerId), c.GIDMappings)
if !found {
return -1, fmt.Errorf("user namespaces enabled, but no mapping found for gid %d", containerId)
}
// If we are a 32-bit binary running on a 64-bit system, it's possible
// the mapped user is too large to store in an int, which means we
// cannot do the mapping. We can't just return an int64, because
// os.Setgid() takes an int.
if id > math.MaxInt {
return -1, fmt.Errorf("mapping for gid %d (host id %d) is larger than native integer size (%d)", containerId, id, math.MaxInt)
}
return int(id), nil
}
// Return unchanged id.
return containerId, nil
}
// HostRootGID gets the root gid for the process on host which could be non-zero
// when user namespaces are enabled.
func (c Config) HostRootGID() (int, error) {
return c.HostGID(0)
}
// Utility function that gets a host ID for a container ID from user namespace map
// if that ID is present in the map.
func (c Config) hostIDFromMapping(containerID int64, uMap []IDMap) (int64, bool) {
for _, m := range uMap {
if (containerID >= m.ContainerID) && (containerID <= (m.ContainerID + m.Size - 1)) {
hostID := m.HostID + (containerID - m.ContainerID)
return hostID, true
}
}
return -1, false
}

View File

@ -1,9 +0,0 @@
//go:build gofuzz
package configs
func FuzzUnmarshalJSON(data []byte) int {
hooks := Hooks{}
_ = hooks.UnmarshalJSON(data)
return 1
}

View File

@ -1,16 +0,0 @@
package configs
type IntelRdt struct {
// The identity for RDT Class of Service
ClosID string `json:"closID,omitempty"`
// The schema for L3 cache id and capacity bitmask (CBM)
// Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
L3CacheSchema string `json:"l3_cache_schema,omitempty"`
// The schema of memory bandwidth per L3 cache id
// Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."
// The unit of memory bandwidth is specified in "percentages" by
// default, and in "MBps" if MBA Software Controller is enabled.
MemBwSchema string `json:"memBwSchema,omitempty"`
}

View File

@ -1,7 +0,0 @@
package configs
const (
// EXT_COPYUP is a directive to copy up the contents of a directory when
// a tmpfs is mounted over it.
EXT_COPYUP = 1 << iota //nolint:golint,revive // ignore "don't use ALL_CAPS" warning
)

View File

@ -1,66 +0,0 @@
package configs
import "golang.org/x/sys/unix"
type MountIDMapping struct {
// Recursive indicates if the mapping needs to be recursive.
Recursive bool `json:"recursive"`
// UserNSPath is a path to a user namespace that indicates the necessary
// id-mappings for MOUNT_ATTR_IDMAP. If set to non-"", UIDMappings and
// GIDMappings must be set to nil.
UserNSPath string `json:"userns_path,omitempty"`
// UIDMappings is the uid mapping set for this mount, to be used with
// MOUNT_ATTR_IDMAP.
UIDMappings []IDMap `json:"uid_mappings,omitempty"`
// GIDMappings is the gid mapping set for this mount, to be used with
// MOUNT_ATTR_IDMAP.
GIDMappings []IDMap `json:"gid_mappings,omitempty"`
}
type Mount struct {
// Source path for the mount.
Source string `json:"source"`
// Destination path for the mount inside the container.
Destination string `json:"destination"`
// Device the mount is for.
Device string `json:"device"`
// Mount flags.
Flags int `json:"flags"`
// Mount flags that were explicitly cleared in the configuration (meaning
// the user explicitly requested that these flags *not* be set).
ClearedFlags int `json:"cleared_flags"`
// Propagation Flags
PropagationFlags []int `json:"propagation_flags"`
// Mount data applied to the mount.
Data string `json:"data"`
// Relabel source if set, "z" indicates shared, "Z" indicates unshared.
Relabel string `json:"relabel"`
// RecAttr represents mount properties to be applied recursively (AT_RECURSIVE), see mount_setattr(2).
RecAttr *unix.MountAttr `json:"rec_attr"`
// Extensions are additional flags that are specific to runc.
Extensions int `json:"extensions"`
// Mapping is the MOUNT_ATTR_IDMAP configuration for the mount. If non-nil,
// the mount is configured to use MOUNT_ATTR_IDMAP-style id mappings.
IDMapping *MountIDMapping `json:"id_mapping,omitempty"`
}
func (m *Mount) IsBind() bool {
return m.Flags&unix.MS_BIND != 0
}
func (m *Mount) IsIDMapped() bool {
return m.IDMapping != nil
}

View File

@ -1,9 +0,0 @@
//go:build !linux
package configs
type Mount struct{}
func (m *Mount) IsBind() bool {
return false
}

View File

@ -1,5 +0,0 @@
package configs
type NamespaceType string
type Namespaces []Namespace

View File

@ -1,133 +0,0 @@
package configs
import (
"fmt"
"os"
"sync"
)
const (
NEWNET NamespaceType = "NEWNET"
NEWPID NamespaceType = "NEWPID"
NEWNS NamespaceType = "NEWNS"
NEWUTS NamespaceType = "NEWUTS"
NEWIPC NamespaceType = "NEWIPC"
NEWUSER NamespaceType = "NEWUSER"
NEWCGROUP NamespaceType = "NEWCGROUP"
NEWTIME NamespaceType = "NEWTIME"
)
var (
nsLock sync.Mutex
supportedNamespaces = make(map[NamespaceType]bool)
)
// NsName converts the namespace type to its filename
func NsName(ns NamespaceType) string {
switch ns {
case NEWNET:
return "net"
case NEWNS:
return "mnt"
case NEWPID:
return "pid"
case NEWIPC:
return "ipc"
case NEWUSER:
return "user"
case NEWUTS:
return "uts"
case NEWCGROUP:
return "cgroup"
case NEWTIME:
return "time"
}
return ""
}
// IsNamespaceSupported returns whether a namespace is available or
// not
func IsNamespaceSupported(ns NamespaceType) bool {
nsLock.Lock()
defer nsLock.Unlock()
supported, ok := supportedNamespaces[ns]
if ok {
return supported
}
nsFile := NsName(ns)
// if the namespace type is unknown, just return false
if nsFile == "" {
return false
}
// We don't need to use /proc/thread-self here because the list of
// namespace types is unrelated to the thread. This lets us avoid having to
// do runtime.LockOSThread.
_, err := os.Stat("/proc/self/ns/" + nsFile)
// a namespace is supported if it exists and we have permissions to read it
supported = err == nil
supportedNamespaces[ns] = supported
return supported
}
func NamespaceTypes() []NamespaceType {
return []NamespaceType{
NEWUSER, // Keep user NS always first, don't move it.
NEWIPC,
NEWUTS,
NEWNET,
NEWPID,
NEWNS,
NEWCGROUP,
NEWTIME,
}
}
// Namespace defines configuration for each namespace. It specifies an
// alternate path that is able to be joined via setns.
type Namespace struct {
Type NamespaceType `json:"type"`
Path string `json:"path"`
}
func (n *Namespace) GetPath(pid int) string {
return fmt.Sprintf("/proc/%d/ns/%s", pid, NsName(n.Type))
}
func (n *Namespaces) Remove(t NamespaceType) bool {
i := n.index(t)
if i == -1 {
return false
}
*n = append((*n)[:i], (*n)[i+1:]...)
return true
}
func (n *Namespaces) Add(t NamespaceType, path string) {
i := n.index(t)
if i == -1 {
*n = append(*n, Namespace{Type: t, Path: path})
return
}
(*n)[i].Path = path
}
func (n *Namespaces) index(t NamespaceType) int {
for i, ns := range *n {
if ns.Type == t {
return i
}
}
return -1
}
func (n *Namespaces) Contains(t NamespaceType) bool {
return n.index(t) != -1
}
func (n *Namespaces) PathOf(t NamespaceType) string {
i := n.index(t)
if i == -1 {
return ""
}
return (*n)[i].Path
}

View File

@ -1,45 +0,0 @@
//go:build linux
package configs
import "golang.org/x/sys/unix"
func (n *Namespace) Syscall() int {
return namespaceInfo[n.Type]
}
var namespaceInfo = map[NamespaceType]int{
NEWNET: unix.CLONE_NEWNET,
NEWNS: unix.CLONE_NEWNS,
NEWUSER: unix.CLONE_NEWUSER,
NEWIPC: unix.CLONE_NEWIPC,
NEWUTS: unix.CLONE_NEWUTS,
NEWPID: unix.CLONE_NEWPID,
NEWCGROUP: unix.CLONE_NEWCGROUP,
NEWTIME: unix.CLONE_NEWTIME,
}
// CloneFlags parses the container's Namespaces options to set the correct
// flags on clone, unshare. This function returns flags only for new namespaces.
func (n *Namespaces) CloneFlags() uintptr {
var flag int
for _, v := range *n {
if v.Path != "" {
continue
}
flag |= namespaceInfo[v.Type]
}
return uintptr(flag)
}
// IsPrivate tells whether the namespace of type t is configured as private
// (i.e. it exists and is not shared).
func (n Namespaces) IsPrivate(t NamespaceType) bool {
for _, v := range n {
if v.Type == t {
return v.Path == ""
}
}
// Not found, so implicitly sharing a parent namespace.
return false
}

View File

@ -1,13 +0,0 @@
//go:build !linux && !windows
package configs
func (n *Namespace) Syscall() int {
panic("No namespace syscall support")
}
// CloneFlags parses the container's Namespaces options to set the correct
// flags on clone, unshare. This function returns flags only for new namespaces.
func (n *Namespaces) CloneFlags() uintptr {
panic("No namespace syscall support")
}

View File

@ -1,7 +0,0 @@
//go:build !linux
package configs
// Namespace defines configuration for each namespace. It specifies an
// alternate path that is able to be joined via setns.
type Namespace struct{}

View File

@ -1,75 +0,0 @@
package configs
// Network defines configuration for a container's networking stack
//
// The network configuration can be omitted from a container causing the
// container to be setup with the host's networking stack
type Network struct {
// Type sets the networks type, commonly veth and loopback
Type string `json:"type"`
// Name of the network interface
Name string `json:"name"`
// The bridge to use.
Bridge string `json:"bridge"`
// MacAddress contains the MAC address to set on the network interface
MacAddress string `json:"mac_address"`
// Address contains the IPv4 and mask to set on the network interface
Address string `json:"address"`
// Gateway sets the gateway address that is used as the default for the interface
Gateway string `json:"gateway"`
// IPv6Address contains the IPv6 and mask to set on the network interface
IPv6Address string `json:"ipv6_address"`
// IPv6Gateway sets the ipv6 gateway address that is used as the default for the interface
IPv6Gateway string `json:"ipv6_gateway"`
// Mtu sets the mtu value for the interface and will be mirrored on both the host and
// container's interfaces if a pair is created, specifically in the case of type veth
// Note: This does not apply to loopback interfaces.
Mtu int `json:"mtu"`
// TxQueueLen sets the tx_queuelen value for the interface and will be mirrored on both the host and
// container's interfaces if a pair is created, specifically in the case of type veth
// Note: This does not apply to loopback interfaces.
TxQueueLen int `json:"txqueuelen"`
// HostInterfaceName is a unique name of a veth pair that resides on in the host interface of the
// container.
HostInterfaceName string `json:"host_interface_name"`
// HairpinMode specifies if hairpin NAT should be enabled on the virtual interface
// bridge port in the case of type veth
// Note: This is unsupported on some systems.
// Note: This does not apply to loopback interfaces.
HairpinMode bool `json:"hairpin_mode"`
}
// Route defines a routing table entry.
//
// Routes can be specified to create entries in the routing table as the container
// is started.
//
// All of destination, source, and gateway should be either IPv4 or IPv6.
// One of the three options must be present, and omitted entries will use their
// IP family default for the route table. For IPv4 for example, setting the
// gateway to 1.2.3.4 and the interface to eth0 will set up a standard
// destination of 0.0.0.0(or *) when viewed in the route table.
type Route struct {
// Destination specifies the destination IP address and mask in the CIDR form.
Destination string `json:"destination"`
// Source specifies the source IP address and mask in the CIDR form.
Source string `json:"source"`
// Gateway specifies the gateway IP address.
Gateway string `json:"gateway"`
// InterfaceName specifies the device to set this route up for, for example eth0.
InterfaceName string `json:"interface_name"`
}

View File

@ -1,119 +0,0 @@
//go:build !windows
package devices
import (
"errors"
"os"
"path/filepath"
"golang.org/x/sys/unix"
)
// ErrNotADevice denotes that a file is not a valid linux device.
var ErrNotADevice = errors.New("not a device node")
// Testing dependencies
var (
unixLstat = unix.Lstat
osReadDir = os.ReadDir
)
func mkDev(d *Rule) (uint64, error) {
if d.Major == Wildcard || d.Minor == Wildcard {
return 0, errors.New("cannot mkdev() device with wildcards")
}
return unix.Mkdev(uint32(d.Major), uint32(d.Minor)), nil
}
// DeviceFromPath takes the path to a device and its cgroup_permissions (which
// cannot be easily queried) to look up the information about a linux device
// and returns that information as a Device struct.
func DeviceFromPath(path, permissions string) (*Device, error) {
var stat unix.Stat_t
err := unixLstat(path, &stat)
if err != nil {
return nil, err
}
var (
devType Type
mode = stat.Mode
devNumber = uint64(stat.Rdev) //nolint:unconvert // Rdev is uint32 on e.g. MIPS.
major = unix.Major(devNumber)
minor = unix.Minor(devNumber)
)
switch mode & unix.S_IFMT {
case unix.S_IFBLK:
devType = BlockDevice
case unix.S_IFCHR:
devType = CharDevice
case unix.S_IFIFO:
devType = FifoDevice
default:
return nil, ErrNotADevice
}
return &Device{
Rule: Rule{
Type: devType,
Major: int64(major),
Minor: int64(minor),
Permissions: Permissions(permissions),
},
Path: path,
FileMode: os.FileMode(mode &^ unix.S_IFMT),
Uid: stat.Uid,
Gid: stat.Gid,
}, nil
}
// HostDevices returns all devices that can be found under /dev directory.
func HostDevices() ([]*Device, error) {
return GetDevices("/dev")
}
// GetDevices recursively traverses a directory specified by path
// and returns all devices found there.
func GetDevices(path string) ([]*Device, error) {
files, err := osReadDir(path)
if err != nil {
return nil, err
}
var out []*Device
for _, f := range files {
switch {
case f.IsDir():
switch f.Name() {
// ".lxc" & ".lxd-mounts" added to address https://github.com/lxc/lxd/issues/2825
// ".udev" added to address https://github.com/opencontainers/runc/issues/2093
case "pts", "shm", "fd", "mqueue", ".lxc", ".lxd-mounts", ".udev":
continue
default:
sub, err := GetDevices(filepath.Join(path, f.Name()))
if err != nil {
return nil, err
}
out = append(out, sub...)
continue
}
case f.Name() == "console":
continue
}
device, err := DeviceFromPath(filepath.Join(path, f.Name()), "rwm")
if err != nil {
if errors.Is(err, ErrNotADevice) {
continue
}
if os.IsNotExist(err) {
continue
}
return nil, err
}
if device.Type == FifoDevice {
continue
}
out = append(out, device)
}
return out, nil
}

View File

@ -1,23 +0,0 @@
package intelrdt
var cmtEnabled bool
// Check if Intel RDT/CMT is enabled.
func IsCMTEnabled() bool {
featuresInit()
return cmtEnabled
}
func getCMTNumaNodeStats(numaPath string) (*CMTNumaNodeStats, error) {
stats := &CMTNumaNodeStats{}
if enabledMonFeatures.llcOccupancy {
llcOccupancy, err := getIntelRdtParamUint(numaPath, "llc_occupancy")
if err != nil {
return nil, err
}
stats.LLCOccupancy = llcOccupancy
}
return stats, nil
}

View File

@ -1,681 +0,0 @@
package intelrdt
import (
"bytes"
"errors"
"fmt"
"os"
"path/filepath"
"strconv"
"strings"
"sync"
"github.com/moby/sys/mountinfo"
"golang.org/x/sys/unix"
"github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
"github.com/opencontainers/runc/libcontainer/configs"
)
/*
* About Intel RDT features:
* Intel platforms with new Xeon CPU support Resource Director Technology (RDT).
* Cache Allocation Technology (CAT) and Memory Bandwidth Allocation (MBA) are
* two sub-features of RDT.
*
* Cache Allocation Technology (CAT) provides a way for the software to restrict
* cache allocation to a defined 'subset' of L3 cache which may be overlapping
* with other 'subsets'. The different subsets are identified by class of
* service (CLOS) and each CLOS has a capacity bitmask (CBM).
*
* Memory Bandwidth Allocation (MBA) provides indirect and approximate throttle
* over memory bandwidth for the software. A user controls the resource by
* indicating the percentage of maximum memory bandwidth or memory bandwidth
* limit in MBps unit if MBA Software Controller is enabled.
*
* More details about Intel RDT CAT and MBA can be found in the section 17.18
* of Intel Software Developer Manual:
* https://software.intel.com/en-us/articles/intel-sdm
*
* About Intel RDT kernel interface:
* In Linux 4.10 kernel or newer, the interface is defined and exposed via
* "resource control" filesystem, which is a "cgroup-like" interface.
*
* Comparing with cgroups, it has similar process management lifecycle and
* interfaces in a container. But unlike cgroups' hierarchy, it has single level
* filesystem layout.
*
* CAT and MBA features are introduced in Linux 4.10 and 4.12 kernel via
* "resource control" filesystem.
*
* Intel RDT "resource control" filesystem hierarchy:
* mount -t resctrl resctrl /sys/fs/resctrl
* tree /sys/fs/resctrl
* /sys/fs/resctrl/
* |-- info
* | |-- L3
* | | |-- cbm_mask
* | | |-- min_cbm_bits
* | | |-- num_closids
* | |-- L3_MON
* | | |-- max_threshold_occupancy
* | | |-- mon_features
* | | |-- num_rmids
* | |-- MB
* | |-- bandwidth_gran
* | |-- delay_linear
* | |-- min_bandwidth
* | |-- num_closids
* |-- ...
* |-- schemata
* |-- tasks
* |-- <clos>
* |-- ...
* |-- schemata
* |-- tasks
*
* For runc, we can make use of `tasks` and `schemata` configuration for L3
* cache and memory bandwidth resources constraints.
*
* The file `tasks` has a list of tasks that belongs to this group (e.g.,
* <container_id>" group). Tasks can be added to a group by writing the task ID
* to the "tasks" file (which will automatically remove them from the previous
* group to which they belonged). New tasks created by fork(2) and clone(2) are
* added to the same group as their parent.
*
* The file `schemata` has a list of all the resources available to this group.
* Each resource (L3 cache, memory bandwidth) has its own line and format.
*
* L3 cache schema:
* It has allocation bitmasks/values for L3 cache on each socket, which
* contains L3 cache id and capacity bitmask (CBM).
* Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
* For example, on a two-socket machine, the schema line could be "L3:0=ff;1=c0"
* which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
*
* The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
* be set is less than the max bit. The max bits in the CBM is varied among
* supported Intel CPU models. Kernel will check if it is valid when writing.
* e.g., default value 0xfffff in root indicates the max bits of CBM is 20
* bits, which mapping to entire L3 cache capacity. Some valid CBM values to
* set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
*
* Memory bandwidth schema:
* It has allocation values for memory bandwidth on each socket, which contains
* L3 cache id and memory bandwidth.
* Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."
* For example, on a two-socket machine, the schema line could be "MB:0=20;1=70"
*
* The minimum bandwidth percentage value for each CPU model is predefined and
* can be looked up through "info/MB/min_bandwidth". The bandwidth granularity
* that is allocated is also dependent on the CPU model and can be looked up at
* "info/MB/bandwidth_gran". The available bandwidth control steps are:
* min_bw + N * bw_gran. Intermediate values are rounded to the next control
* step available on the hardware.
*
* If MBA Software Controller is enabled through mount option "-o mba_MBps":
* mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl
* We could specify memory bandwidth in "MBps" (Mega Bytes per second) unit
* instead of "percentages". The kernel underneath would use a software feedback
* mechanism or a "Software Controller" which reads the actual bandwidth using
* MBM counters and adjust the memory bandwidth percentages to ensure:
* "actual memory bandwidth < user specified memory bandwidth".
*
* For example, on a two-socket machine, the schema line could be
* "MB:0=5000;1=7000" which means 5000 MBps memory bandwidth limit on socket 0
* and 7000 MBps memory bandwidth limit on socket 1.
*
* For more information about Intel RDT kernel interface:
* https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
*
* An example for runc:
* Consider a two-socket machine with two L3 caches where the default CBM is
* 0x7ff and the max CBM length is 11 bits, and minimum memory bandwidth of 10%
* with a memory bandwidth granularity of 10%.
*
* Tasks inside the container only have access to the "upper" 7/11 of L3 cache
* on socket 0 and the "lower" 5/11 L3 cache on socket 1, and may use a
* maximum memory bandwidth of 20% on socket 0 and 70% on socket 1.
*
* "linux": {
* "intelRdt": {
* "l3CacheSchema": "L3:0=7f0;1=1f",
* "memBwSchema": "MB:0=20;1=70"
* }
* }
*/
type Manager struct {
mu sync.Mutex
config *configs.Config
id string
path string
}
// NewManager returns a new instance of Manager, or nil if the Intel RDT
// functionality is not specified in the config, available from hardware or
// enabled in the kernel.
func NewManager(config *configs.Config, id string, path string) *Manager {
if config.IntelRdt == nil {
return nil
}
if _, err := Root(); err != nil {
// Intel RDT is not available.
return nil
}
return newManager(config, id, path)
}
// newManager is the same as NewManager, except it does not check if the feature
// is actually available. Used by unit tests that mock intelrdt paths.
func newManager(config *configs.Config, id string, path string) *Manager {
return &Manager{
config: config,
id: id,
path: path,
}
}
const (
intelRdtTasks = "tasks"
)
var (
// The flag to indicate if Intel RDT/CAT is enabled
catEnabled bool
// The flag to indicate if Intel RDT/MBA is enabled
mbaEnabled bool
// For Intel RDT initialization
initOnce sync.Once
errNotFound = errors.New("Intel RDT not available")
)
// Check if Intel RDT sub-features are enabled in featuresInit()
func featuresInit() {
initOnce.Do(func() {
// 1. Check if Intel RDT "resource control" filesystem is available.
// The user guarantees to mount the filesystem.
root, err := Root()
if err != nil {
return
}
// 2. Check if Intel RDT sub-features are available in "resource
// control" filesystem. Intel RDT sub-features can be
// selectively disabled or enabled by kernel command line
// (e.g., rdt=!l3cat,mba) in 4.14 and newer kernel
if _, err := os.Stat(filepath.Join(root, "info", "L3")); err == nil {
catEnabled = true
}
if _, err := os.Stat(filepath.Join(root, "info", "MB")); err == nil {
mbaEnabled = true
}
if _, err := os.Stat(filepath.Join(root, "info", "L3_MON")); err != nil {
return
}
enabledMonFeatures, err = getMonFeatures(root)
if err != nil {
return
}
if enabledMonFeatures.mbmTotalBytes || enabledMonFeatures.mbmLocalBytes {
mbmEnabled = true
}
if enabledMonFeatures.llcOccupancy {
cmtEnabled = true
}
})
}
// findIntelRdtMountpointDir returns the mount point of the Intel RDT "resource control" filesystem.
func findIntelRdtMountpointDir() (string, error) {
mi, err := mountinfo.GetMounts(func(m *mountinfo.Info) (bool, bool) {
// similar to mountinfo.FSTypeFilter but stops after the first match
if m.FSType == "resctrl" {
return false, true // don't skip, stop
}
return true, false // skip, keep going
})
if err != nil {
return "", err
}
if len(mi) < 1 {
return "", errNotFound
}
return mi[0].Mountpoint, nil
}
// For Root() use only.
var (
intelRdtRoot string
intelRdtRootErr error
rootOnce sync.Once
)
// The kernel creates this (empty) directory if resctrl is supported by the
// hardware and kernel. The user is responsible for mounting the resctrl
// filesystem, and they could mount it somewhere else if they wanted to.
const defaultResctrlMountpoint = "/sys/fs/resctrl"
// Root returns the Intel RDT "resource control" filesystem mount point.
func Root() (string, error) {
rootOnce.Do(func() {
// Does this system support resctrl?
var statfs unix.Statfs_t
if err := unix.Statfs(defaultResctrlMountpoint, &statfs); err != nil {
if errors.Is(err, unix.ENOENT) {
err = errNotFound
}
intelRdtRootErr = err
return
}
// Has the resctrl fs been mounted to the default mount point?
if statfs.Type == unix.RDTGROUP_SUPER_MAGIC {
intelRdtRoot = defaultResctrlMountpoint
return
}
// The resctrl fs could have been mounted somewhere nonstandard.
intelRdtRoot, intelRdtRootErr = findIntelRdtMountpointDir()
})
return intelRdtRoot, intelRdtRootErr
}
// Gets a single uint64 value from the specified file.
func getIntelRdtParamUint(path, file string) (uint64, error) {
fileName := filepath.Join(path, file)
contents, err := os.ReadFile(fileName)
if err != nil {
return 0, err
}
res, err := fscommon.ParseUint(string(bytes.TrimSpace(contents)), 10, 64)
if err != nil {
return res, fmt.Errorf("unable to parse %q as a uint from file %q", string(contents), fileName)
}
return res, nil
}
// Gets a string value from the specified file
func getIntelRdtParamString(path, file string) (string, error) {
contents, err := os.ReadFile(filepath.Join(path, file))
if err != nil {
return "", err
}
return string(bytes.TrimSpace(contents)), nil
}
func writeFile(dir, file, data string) error {
if dir == "" {
return fmt.Errorf("no such directory for %s", file)
}
if err := os.WriteFile(filepath.Join(dir, file), []byte(data+"\n"), 0o600); err != nil {
return newLastCmdError(fmt.Errorf("intelrdt: unable to write %v: %w", data, err))
}
return nil
}
// Get the read-only L3 cache information
func getL3CacheInfo() (*L3CacheInfo, error) {
l3CacheInfo := &L3CacheInfo{}
rootPath, err := Root()
if err != nil {
return l3CacheInfo, err
}
path := filepath.Join(rootPath, "info", "L3")
cbmMask, err := getIntelRdtParamString(path, "cbm_mask")
if err != nil {
return l3CacheInfo, err
}
minCbmBits, err := getIntelRdtParamUint(path, "min_cbm_bits")
if err != nil {
return l3CacheInfo, err
}
numClosids, err := getIntelRdtParamUint(path, "num_closids")
if err != nil {
return l3CacheInfo, err
}
l3CacheInfo.CbmMask = cbmMask
l3CacheInfo.MinCbmBits = minCbmBits
l3CacheInfo.NumClosids = numClosids
return l3CacheInfo, nil
}
// Get the read-only memory bandwidth information
func getMemBwInfo() (*MemBwInfo, error) {
memBwInfo := &MemBwInfo{}
rootPath, err := Root()
if err != nil {
return memBwInfo, err
}
path := filepath.Join(rootPath, "info", "MB")
bandwidthGran, err := getIntelRdtParamUint(path, "bandwidth_gran")
if err != nil {
return memBwInfo, err
}
delayLinear, err := getIntelRdtParamUint(path, "delay_linear")
if err != nil {
return memBwInfo, err
}
minBandwidth, err := getIntelRdtParamUint(path, "min_bandwidth")
if err != nil {
return memBwInfo, err
}
numClosids, err := getIntelRdtParamUint(path, "num_closids")
if err != nil {
return memBwInfo, err
}
memBwInfo.BandwidthGran = bandwidthGran
memBwInfo.DelayLinear = delayLinear
memBwInfo.MinBandwidth = minBandwidth
memBwInfo.NumClosids = numClosids
return memBwInfo, nil
}
// Get diagnostics for last filesystem operation error from file info/last_cmd_status
func getLastCmdStatus() (string, error) {
rootPath, err := Root()
if err != nil {
return "", err
}
path := filepath.Join(rootPath, "info")
lastCmdStatus, err := getIntelRdtParamString(path, "last_cmd_status")
if err != nil {
return "", err
}
return lastCmdStatus, nil
}
// WriteIntelRdtTasks writes the specified pid into the "tasks" file
func WriteIntelRdtTasks(dir string, pid int) error {
if dir == "" {
return fmt.Errorf("no such directory for %s", intelRdtTasks)
}
// Don't attach any pid if -1 is specified as a pid
if pid != -1 {
if err := os.WriteFile(filepath.Join(dir, intelRdtTasks), []byte(strconv.Itoa(pid)), 0o600); err != nil {
return newLastCmdError(fmt.Errorf("intelrdt: unable to add pid %d: %w", pid, err))
}
}
return nil
}
// Check if Intel RDT/CAT is enabled
func IsCATEnabled() bool {
featuresInit()
return catEnabled
}
// Check if Intel RDT/MBA is enabled
func IsMBAEnabled() bool {
featuresInit()
return mbaEnabled
}
// Get the path of the clos group in "resource control" filesystem that the container belongs to
func (m *Manager) getIntelRdtPath() (string, error) {
rootPath, err := Root()
if err != nil {
return "", err
}
clos := m.id
if m.config.IntelRdt != nil && m.config.IntelRdt.ClosID != "" {
clos = m.config.IntelRdt.ClosID
}
return filepath.Join(rootPath, clos), nil
}
// Applies Intel RDT configuration to the process with the specified pid
func (m *Manager) Apply(pid int) (err error) {
// If intelRdt is not specified in config, we do nothing
if m.config.IntelRdt == nil {
return nil
}
path, err := m.getIntelRdtPath()
if err != nil {
return err
}
m.mu.Lock()
defer m.mu.Unlock()
if m.config.IntelRdt.ClosID != "" && m.config.IntelRdt.L3CacheSchema == "" && m.config.IntelRdt.MemBwSchema == "" {
// Check that the CLOS exists, i.e. it has been pre-configured to
// conform with the runtime spec
if _, err := os.Stat(path); err != nil {
return fmt.Errorf("clos dir not accessible (must be pre-created when l3CacheSchema and memBwSchema are empty): %w", err)
}
}
if err := os.MkdirAll(path, 0o755); err != nil {
return newLastCmdError(err)
}
if err := WriteIntelRdtTasks(path, pid); err != nil {
return newLastCmdError(err)
}
m.path = path
return nil
}
// Destroys the Intel RDT container-specific 'container_id' group
func (m *Manager) Destroy() error {
// Don't remove resctrl group if closid has been explicitly specified. The
// group is likely externally managed, i.e. by some other entity than us.
// There are probably other containers/tasks sharing the same group.
if m.config.IntelRdt != nil && m.config.IntelRdt.ClosID == "" {
m.mu.Lock()
defer m.mu.Unlock()
if err := os.RemoveAll(m.GetPath()); err != nil {
return err
}
m.path = ""
}
return nil
}
// Returns Intel RDT path to save in a state file and to be able to
// restore the object later
func (m *Manager) GetPath() string {
if m.path == "" {
m.path, _ = m.getIntelRdtPath()
}
return m.path
}
// Returns statistics for Intel RDT
func (m *Manager) GetStats() (*Stats, error) {
// If intelRdt is not specified in config
if m.config.IntelRdt == nil {
return nil, nil
}
m.mu.Lock()
defer m.mu.Unlock()
stats := newStats()
rootPath, err := Root()
if err != nil {
return nil, err
}
// The read-only L3 cache and memory bandwidth schemata in root
tmpRootStrings, err := getIntelRdtParamString(rootPath, "schemata")
if err != nil {
return nil, err
}
schemaRootStrings := strings.Split(tmpRootStrings, "\n")
// The L3 cache and memory bandwidth schemata in container's clos group
containerPath := m.GetPath()
tmpStrings, err := getIntelRdtParamString(containerPath, "schemata")
if err != nil {
return nil, err
}
schemaStrings := strings.Split(tmpStrings, "\n")
if IsCATEnabled() {
// The read-only L3 cache information
l3CacheInfo, err := getL3CacheInfo()
if err != nil {
return nil, err
}
stats.L3CacheInfo = l3CacheInfo
// The read-only L3 cache schema in root
for _, schemaRoot := range schemaRootStrings {
if strings.Contains(schemaRoot, "L3") {
stats.L3CacheSchemaRoot = strings.TrimSpace(schemaRoot)
}
}
// The L3 cache schema in container's clos group
for _, schema := range schemaStrings {
if strings.Contains(schema, "L3") {
stats.L3CacheSchema = strings.TrimSpace(schema)
}
}
}
if IsMBAEnabled() {
// The read-only memory bandwidth information
memBwInfo, err := getMemBwInfo()
if err != nil {
return nil, err
}
stats.MemBwInfo = memBwInfo
// The read-only memory bandwidth information
for _, schemaRoot := range schemaRootStrings {
if strings.Contains(schemaRoot, "MB") {
stats.MemBwSchemaRoot = strings.TrimSpace(schemaRoot)
}
}
// The memory bandwidth schema in container's clos group
for _, schema := range schemaStrings {
if strings.Contains(schema, "MB") {
stats.MemBwSchema = strings.TrimSpace(schema)
}
}
}
if IsMBMEnabled() || IsCMTEnabled() {
err = getMonitoringStats(containerPath, stats)
if err != nil {
return nil, err
}
}
return stats, nil
}
// Set Intel RDT "resource control" filesystem as configured.
func (m *Manager) Set(container *configs.Config) error {
// About L3 cache schema:
// It has allocation bitmasks/values for L3 cache on each socket,
// which contains L3 cache id and capacity bitmask (CBM).
// Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
// For example, on a two-socket machine, the schema line could be:
// L3:0=ff;1=c0
// which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM
// is 0xc0.
//
// The valid L3 cache CBM is a *contiguous bits set* and number of
// bits that can be set is less than the max bit. The max bits in the
// CBM is varied among supported Intel CPU models. Kernel will check
// if it is valid when writing. e.g., default value 0xfffff in root
// indicates the max bits of CBM is 20 bits, which mapping to entire
// L3 cache capacity. Some valid CBM values to set in a group:
// 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
//
//
// About memory bandwidth schema:
// It has allocation values for memory bandwidth on each socket, which
// contains L3 cache id and memory bandwidth.
// Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."
// For example, on a two-socket machine, the schema line could be:
// "MB:0=20;1=70"
//
// The minimum bandwidth percentage value for each CPU model is
// predefined and can be looked up through "info/MB/min_bandwidth".
// The bandwidth granularity that is allocated is also dependent on
// the CPU model and can be looked up at "info/MB/bandwidth_gran".
// The available bandwidth control steps are: min_bw + N * bw_gran.
// Intermediate values are rounded to the next control step available
// on the hardware.
//
// If MBA Software Controller is enabled through mount option
// "-o mba_MBps": mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl
// We could specify memory bandwidth in "MBps" (Mega Bytes per second)
// unit instead of "percentages". The kernel underneath would use a
// software feedback mechanism or a "Software Controller" which reads
// the actual bandwidth using MBM counters and adjust the memory
// bandwidth percentages to ensure:
// "actual memory bandwidth < user specified memory bandwidth".
//
// For example, on a two-socket machine, the schema line could be
// "MB:0=5000;1=7000" which means 5000 MBps memory bandwidth limit on
// socket 0 and 7000 MBps memory bandwidth limit on socket 1.
if container.IntelRdt != nil {
path := m.GetPath()
l3CacheSchema := container.IntelRdt.L3CacheSchema
memBwSchema := container.IntelRdt.MemBwSchema
// TODO: verify that l3CacheSchema and/or memBwSchema match the
// existing schemata if ClosID has been specified. This is a more
// involved than reading the file and doing plain string comparison as
// the value written in does not necessarily match what gets read out
// (leading zeros, cache id ordering etc).
// Write a single joint schema string to schemata file
if l3CacheSchema != "" && memBwSchema != "" {
if err := writeFile(path, "schemata", l3CacheSchema+"\n"+memBwSchema); err != nil {
return err
}
}
// Write only L3 cache schema string to schemata file
if l3CacheSchema != "" && memBwSchema == "" {
if err := writeFile(path, "schemata", l3CacheSchema); err != nil {
return err
}
}
// Write only memory bandwidth schema string to schemata file
if l3CacheSchema == "" && memBwSchema != "" {
if err := writeFile(path, "schemata", memBwSchema); err != nil {
return err
}
}
}
return nil
}
func newLastCmdError(err error) error {
status, err1 := getLastCmdStatus()
if err1 == nil {
return fmt.Errorf("%w, last_cmd_status: %s", err, status)
}
return err
}

View File

@ -1,31 +0,0 @@
package intelrdt
// The flag to indicate if Intel RDT/MBM is enabled
var mbmEnabled bool
// Check if Intel RDT/MBM is enabled.
func IsMBMEnabled() bool {
featuresInit()
return mbmEnabled
}
func getMBMNumaNodeStats(numaPath string) (*MBMNumaNodeStats, error) {
stats := &MBMNumaNodeStats{}
if enabledMonFeatures.mbmTotalBytes {
mbmTotalBytes, err := getIntelRdtParamUint(numaPath, "mbm_total_bytes")
if err != nil {
return nil, err
}
stats.MBMTotalBytes = mbmTotalBytes
}
if enabledMonFeatures.mbmLocalBytes {
mbmLocalBytes, err := getIntelRdtParamUint(numaPath, "mbm_local_bytes")
if err != nil {
return nil, err
}
stats.MBMLocalBytes = mbmLocalBytes
}
return stats, nil
}

View File

@ -1,83 +0,0 @@
package intelrdt
import (
"bufio"
"io"
"os"
"path/filepath"
"github.com/sirupsen/logrus"
)
var enabledMonFeatures monFeatures
type monFeatures struct {
mbmTotalBytes bool
mbmLocalBytes bool
llcOccupancy bool
}
func getMonFeatures(intelRdtRoot string) (monFeatures, error) {
file, err := os.Open(filepath.Join(intelRdtRoot, "info", "L3_MON", "mon_features"))
if err != nil {
return monFeatures{}, err
}
defer file.Close()
return parseMonFeatures(file)
}
func parseMonFeatures(reader io.Reader) (monFeatures, error) {
scanner := bufio.NewScanner(reader)
monFeatures := monFeatures{}
for scanner.Scan() {
switch feature := scanner.Text(); feature {
case "mbm_total_bytes":
monFeatures.mbmTotalBytes = true
case "mbm_local_bytes":
monFeatures.mbmLocalBytes = true
case "llc_occupancy":
monFeatures.llcOccupancy = true
default:
logrus.Warnf("Unsupported Intel RDT monitoring feature: %s", feature)
}
}
return monFeatures, scanner.Err()
}
func getMonitoringStats(containerPath string, stats *Stats) error {
numaFiles, err := os.ReadDir(filepath.Join(containerPath, "mon_data"))
if err != nil {
return err
}
var mbmStats []MBMNumaNodeStats
var cmtStats []CMTNumaNodeStats
for _, file := range numaFiles {
if file.IsDir() {
numaPath := filepath.Join(containerPath, "mon_data", file.Name())
if IsMBMEnabled() {
numaMBMStats, err := getMBMNumaNodeStats(numaPath)
if err != nil {
return err
}
mbmStats = append(mbmStats, *numaMBMStats)
}
if IsCMTEnabled() {
numaCMTStats, err := getCMTNumaNodeStats(numaPath)
if err != nil {
return err
}
cmtStats = append(cmtStats, *numaCMTStats)
}
}
}
stats.MBMStats = &mbmStats
stats.CMTStats = &cmtStats
return err
}

View File

@ -1,57 +0,0 @@
package intelrdt
type L3CacheInfo struct {
CbmMask string `json:"cbm_mask,omitempty"`
MinCbmBits uint64 `json:"min_cbm_bits,omitempty"`
NumClosids uint64 `json:"num_closids,omitempty"`
}
type MemBwInfo struct {
BandwidthGran uint64 `json:"bandwidth_gran,omitempty"`
DelayLinear uint64 `json:"delay_linear,omitempty"`
MinBandwidth uint64 `json:"min_bandwidth,omitempty"`
NumClosids uint64 `json:"num_closids,omitempty"`
}
type MBMNumaNodeStats struct {
// The 'mbm_total_bytes' in 'container_id' group.
MBMTotalBytes uint64 `json:"mbm_total_bytes"`
// The 'mbm_local_bytes' in 'container_id' group.
MBMLocalBytes uint64 `json:"mbm_local_bytes"`
}
type CMTNumaNodeStats struct {
// The 'llc_occupancy' in 'container_id' group.
LLCOccupancy uint64 `json:"llc_occupancy"`
}
type Stats struct {
// The read-only L3 cache information
L3CacheInfo *L3CacheInfo `json:"l3_cache_info,omitempty"`
// The read-only L3 cache schema in root
L3CacheSchemaRoot string `json:"l3_cache_schema_root,omitempty"`
// The L3 cache schema in 'container_id' group
L3CacheSchema string `json:"l3_cache_schema,omitempty"`
// The read-only memory bandwidth information
MemBwInfo *MemBwInfo `json:"mem_bw_info,omitempty"`
// The read-only memory bandwidth schema in root
MemBwSchemaRoot string `json:"mem_bw_schema_root,omitempty"`
// The memory bandwidth schema in 'container_id' group
MemBwSchema string `json:"mem_bw_schema,omitempty"`
// The memory bandwidth monitoring statistics from NUMA nodes in 'container_id' group
MBMStats *[]MBMNumaNodeStats `json:"mbm_stats,omitempty"`
// The cache monitoring technology statistics from NUMA nodes in 'container_id' group
CMTStats *[]CMTNumaNodeStats `json:"cmt_stats,omitempty"`
}
func newStats() *Stats {
return &Stats{}
}

View File

@ -1,119 +0,0 @@
package utils
/*
* Copyright 2016, 2017 SUSE LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import (
"fmt"
"os"
"runtime"
"golang.org/x/sys/unix"
)
// MaxNameLen is the maximum length of the name of a file descriptor being sent
// using SendFile. The name of the file handle returned by RecvFile will never be
// larger than this value.
const MaxNameLen = 4096
// oobSpace is the size of the oob slice required to store a single FD. Note
// that unix.UnixRights appears to make the assumption that fd is always int32,
// so sizeof(fd) = 4.
var oobSpace = unix.CmsgSpace(4)
// RecvFile waits for a file descriptor to be sent over the given AF_UNIX
// socket. The file name of the remote file descriptor will be recreated
// locally (it is sent as non-auxiliary data in the same payload).
func RecvFile(socket *os.File) (_ *os.File, Err error) {
name := make([]byte, MaxNameLen)
oob := make([]byte, oobSpace)
sockfd := socket.Fd()
n, oobn, _, _, err := unix.Recvmsg(int(sockfd), name, oob, unix.MSG_CMSG_CLOEXEC)
if err != nil {
return nil, err
}
if n >= MaxNameLen || oobn != oobSpace {
return nil, fmt.Errorf("recvfile: incorrect number of bytes read (n=%d oobn=%d)", n, oobn)
}
// Truncate.
name = name[:n]
oob = oob[:oobn]
scms, err := unix.ParseSocketControlMessage(oob)
if err != nil {
return nil, err
}
// We cannot control how many SCM_RIGHTS we receive, and upon receiving
// them all of the descriptors are installed in our fd table, so we need to
// parse all of the SCM_RIGHTS we received in order to close all of the
// descriptors on error.
var fds []int
defer func() {
for i, fd := range fds {
if i == 0 && Err == nil {
// Only close the first one on error.
continue
}
// Always close extra ones.
_ = unix.Close(fd)
}
}()
var lastErr error
for _, scm := range scms {
if scm.Header.Type == unix.SCM_RIGHTS {
scmFds, err := unix.ParseUnixRights(&scm)
if err != nil {
lastErr = err
} else {
fds = append(fds, scmFds...)
}
}
}
if lastErr != nil {
return nil, lastErr
}
// We do this after collecting the fds to make sure we close them all when
// returning an error here.
if len(scms) != 1 {
return nil, fmt.Errorf("recvfd: number of SCMs is not 1: %d", len(scms))
}
if len(fds) != 1 {
return nil, fmt.Errorf("recvfd: number of fds is not 1: %d", len(fds))
}
return os.NewFile(uintptr(fds[0]), string(name)), nil
}
// SendFile sends a file over the given AF_UNIX socket. file.Name() is also
// included so that if the other end uses RecvFile, the file will have the same
// name information.
func SendFile(socket *os.File, file *os.File) error {
name := file.Name()
if len(name) >= MaxNameLen {
return fmt.Errorf("sendfd: filename too long: %s", name)
}
err := SendRawFd(socket, name, file.Fd())
runtime.KeepAlive(file)
return err
}
// SendRawFd sends a specific file descriptor over the given AF_UNIX socket.
func SendRawFd(socket *os.File, msg string, fd uintptr) error {
oob := unix.UnixRights(int(fd))
return unix.Sendmsg(int(socket.Fd()), []byte(msg), oob, nil, 0)
}

View File

@ -1,115 +0,0 @@
package utils
import (
"encoding/json"
"io"
"os"
"path/filepath"
"strings"
"golang.org/x/sys/unix"
)
const (
exitSignalOffset = 128
)
// ExitStatus returns the correct exit status for a process based on if it
// was signaled or exited cleanly
func ExitStatus(status unix.WaitStatus) int {
if status.Signaled() {
return exitSignalOffset + int(status.Signal())
}
return status.ExitStatus()
}
// WriteJSON writes the provided struct v to w using standard json marshaling
// without a trailing newline. This is used instead of json.Encoder because
// there might be a problem in json decoder in some cases, see:
// https://github.com/docker/docker/issues/14203#issuecomment-174177790
func WriteJSON(w io.Writer, v interface{}) error {
data, err := json.Marshal(v)
if err != nil {
return err
}
_, err = w.Write(data)
return err
}
// CleanPath makes a path safe for use with filepath.Join. This is done by not
// only cleaning the path, but also (if the path is relative) adding a leading
// '/' and cleaning it (then removing the leading '/'). This ensures that a
// path resulting from prepending another path will always resolve to lexically
// be a subdirectory of the prefixed path. This is all done lexically, so paths
// that include symlinks won't be safe as a result of using CleanPath.
func CleanPath(path string) string {
// Deal with empty strings nicely.
if path == "" {
return ""
}
// Ensure that all paths are cleaned (especially problematic ones like
// "/../../../../../" which can cause lots of issues).
path = filepath.Clean(path)
// If the path isn't absolute, we need to do more processing to fix paths
// such as "../../../../<etc>/some/path". We also shouldn't convert absolute
// paths to relative ones.
if !filepath.IsAbs(path) {
path = filepath.Clean(string(os.PathSeparator) + path)
// This can't fail, as (by definition) all paths are relative to root.
path, _ = filepath.Rel(string(os.PathSeparator), path)
}
// Clean the path again for good measure.
return filepath.Clean(path)
}
// stripRoot returns the passed path, stripping the root path if it was
// (lexicially) inside it. Note that both passed paths will always be treated
// as absolute, and the returned path will also always be absolute. In
// addition, the paths are cleaned before stripping the root.
func stripRoot(root, path string) string {
// Make the paths clean and absolute.
root, path = CleanPath("/"+root), CleanPath("/"+path)
switch {
case path == root:
path = "/"
case root == "/":
// do nothing
case strings.HasPrefix(path, root+"/"):
path = strings.TrimPrefix(path, root+"/")
}
return CleanPath("/" + path)
}
// SearchLabels searches through a list of key=value pairs for a given key,
// returning its value, and the binary flag telling whether the key exist.
func SearchLabels(labels []string, key string) (string, bool) {
key += "="
for _, s := range labels {
if strings.HasPrefix(s, key) {
return s[len(key):], true
}
}
return "", false
}
// Annotations returns the bundle path and user defined annotations from the
// libcontainer state. We need to remove the bundle because that is a label
// added by libcontainer.
func Annotations(labels []string) (bundle string, userAnnotations map[string]string) {
userAnnotations = make(map[string]string)
for _, l := range labels {
name, value, ok := strings.Cut(l, "=")
if !ok {
continue
}
if name == "bundle" {
bundle = value
} else {
userAnnotations[name] = value
}
}
return
}

View File

@ -1,363 +0,0 @@
//go:build !windows
package utils
import (
"fmt"
"math"
"os"
"path/filepath"
"runtime"
"strconv"
"strings"
"sync"
_ "unsafe" // for go:linkname
securejoin "github.com/cyphar/filepath-securejoin"
"github.com/sirupsen/logrus"
"golang.org/x/sys/unix"
)
// EnsureProcHandle returns whether or not the given file handle is on procfs.
func EnsureProcHandle(fh *os.File) error {
var buf unix.Statfs_t
if err := unix.Fstatfs(int(fh.Fd()), &buf); err != nil {
return fmt.Errorf("ensure %s is on procfs: %w", fh.Name(), err)
}
if buf.Type != unix.PROC_SUPER_MAGIC {
return fmt.Errorf("%s is not on procfs", fh.Name())
}
return nil
}
var (
haveCloseRangeCloexecBool bool
haveCloseRangeCloexecOnce sync.Once
)
func haveCloseRangeCloexec() bool {
haveCloseRangeCloexecOnce.Do(func() {
// Make sure we're not closing a random file descriptor.
tmpFd, err := unix.FcntlInt(0, unix.F_DUPFD_CLOEXEC, 0)
if err != nil {
return
}
defer unix.Close(tmpFd)
err = unix.CloseRange(uint(tmpFd), uint(tmpFd), unix.CLOSE_RANGE_CLOEXEC)
// Any error means we cannot use close_range(CLOSE_RANGE_CLOEXEC).
// -ENOSYS and -EINVAL ultimately mean we don't have support, but any
// other potential error would imply that even the most basic close
// operation wouldn't work.
haveCloseRangeCloexecBool = err == nil
})
return haveCloseRangeCloexecBool
}
type fdFunc func(fd int)
// fdRangeFrom calls the passed fdFunc for each file descriptor that is open in
// the current process.
func fdRangeFrom(minFd int, fn fdFunc) error {
procSelfFd, closer := ProcThreadSelf("fd")
defer closer()
fdDir, err := os.Open(procSelfFd)
if err != nil {
return err
}
defer fdDir.Close()
if err := EnsureProcHandle(fdDir); err != nil {
return err
}
fdList, err := fdDir.Readdirnames(-1)
if err != nil {
return err
}
for _, fdStr := range fdList {
fd, err := strconv.Atoi(fdStr)
// Ignore non-numeric file names.
if err != nil {
continue
}
// Ignore descriptors lower than our specified minimum.
if fd < minFd {
continue
}
// Ignore the file descriptor we used for readdir, as it will be closed
// when we return.
if uintptr(fd) == fdDir.Fd() {
continue
}
// Run the closure.
fn(fd)
}
return nil
}
// CloseExecFrom sets the O_CLOEXEC flag on all file descriptors greater or
// equal to minFd in the current process.
func CloseExecFrom(minFd int) error {
// Use close_range(CLOSE_RANGE_CLOEXEC) if possible.
if haveCloseRangeCloexec() {
err := unix.CloseRange(uint(minFd), math.MaxUint, unix.CLOSE_RANGE_CLOEXEC)
return os.NewSyscallError("close_range", err)
}
// Otherwise, fall back to the standard loop.
return fdRangeFrom(minFd, unix.CloseOnExec)
}
//go:linkname runtime_IsPollDescriptor internal/poll.IsPollDescriptor
// In order to make sure we do not close the internal epoll descriptors the Go
// runtime uses, we need to ensure that we skip descriptors that match
// "internal/poll".IsPollDescriptor. Yes, this is a Go runtime internal thing,
// unfortunately there's no other way to be sure we're only keeping the file
// descriptors the Go runtime needs. Hopefully nothing blows up doing this...
func runtime_IsPollDescriptor(fd uintptr) bool //nolint:revive
// UnsafeCloseFrom closes all file descriptors greater or equal to minFd in the
// current process, except for those critical to Go's runtime (such as the
// netpoll management descriptors).
//
// NOTE: That this function is incredibly dangerous to use in most Go code, as
// closing file descriptors from underneath *os.File handles can lead to very
// bad behaviour (the closed file descriptor can be re-used and then any
// *os.File operations would apply to the wrong file). This function is only
// intended to be called from the last stage of runc init.
func UnsafeCloseFrom(minFd int) error {
// We cannot use close_range(2) even if it is available, because we must
// not close some file descriptors.
return fdRangeFrom(minFd, func(fd int) {
if runtime_IsPollDescriptor(uintptr(fd)) {
// These are the Go runtimes internal netpoll file descriptors.
// These file descriptors are operated on deep in the Go scheduler,
// and closing those files from underneath Go can result in panics.
// There is no issue with keeping them because they are not
// executable and are not useful to an attacker anyway. Also we
// don't have any choice.
return
}
// There's nothing we can do about errors from close(2), and the
// only likely error to be seen is EBADF which indicates the fd was
// already closed (in which case, we got what we wanted).
_ = unix.Close(fd)
})
}
// NewSockPair returns a new SOCK_STREAM unix socket pair.
func NewSockPair(name string) (parent, child *os.File, err error) {
fds, err := unix.Socketpair(unix.AF_LOCAL, unix.SOCK_STREAM|unix.SOCK_CLOEXEC, 0)
if err != nil {
return nil, nil, err
}
return os.NewFile(uintptr(fds[1]), name+"-p"), os.NewFile(uintptr(fds[0]), name+"-c"), nil
}
// WithProcfd runs the passed closure with a procfd path (/proc/self/fd/...)
// corresponding to the unsafePath resolved within the root. Before passing the
// fd, this path is verified to have been inside the root -- so operating on it
// through the passed fdpath should be safe. Do not access this path through
// the original path strings, and do not attempt to use the pathname outside of
// the passed closure (the file handle will be freed once the closure returns).
func WithProcfd(root, unsafePath string, fn func(procfd string) error) error {
// Remove the root then forcefully resolve inside the root.
unsafePath = stripRoot(root, unsafePath)
path, err := securejoin.SecureJoin(root, unsafePath)
if err != nil {
return fmt.Errorf("resolving path inside rootfs failed: %w", err)
}
procSelfFd, closer := ProcThreadSelf("fd/")
defer closer()
// Open the target path.
fh, err := os.OpenFile(path, unix.O_PATH|unix.O_CLOEXEC, 0)
if err != nil {
return fmt.Errorf("open o_path procfd: %w", err)
}
defer fh.Close()
procfd := filepath.Join(procSelfFd, strconv.Itoa(int(fh.Fd())))
// Double-check the path is the one we expected.
if realpath, err := os.Readlink(procfd); err != nil {
return fmt.Errorf("procfd verification failed: %w", err)
} else if realpath != path {
return fmt.Errorf("possibly malicious path detected -- refusing to operate on %s", realpath)
}
return fn(procfd)
}
type ProcThreadSelfCloser func()
var (
haveProcThreadSelf bool
haveProcThreadSelfOnce sync.Once
)
// ProcThreadSelf returns a string that is equivalent to
// /proc/thread-self/<subpath>, with a graceful fallback on older kernels where
// /proc/thread-self doesn't exist. This method DOES NOT use SecureJoin,
// meaning that the passed string needs to be trusted. The caller _must_ call
// the returned procThreadSelfCloser function (which is runtime.UnlockOSThread)
// *only once* after it has finished using the returned path string.
func ProcThreadSelf(subpath string) (string, ProcThreadSelfCloser) {
haveProcThreadSelfOnce.Do(func() {
if _, err := os.Stat("/proc/thread-self/"); err == nil {
haveProcThreadSelf = true
} else {
logrus.Debugf("cannot stat /proc/thread-self (%v), falling back to /proc/self/task/<tid>", err)
}
})
// We need to lock our thread until the caller is done with the path string
// because any non-atomic operation on the path (such as opening a file,
// then reading it) could be interrupted by the Go runtime where the
// underlying thread is swapped out and the original thread is killed,
// resulting in pull-your-hair-out-hard-to-debug issues in the caller. In
// addition, the pre-3.17 fallback makes everything non-atomic because the
// same thing could happen between unix.Gettid() and the path operations.
//
// In theory, we don't need to lock in the atomic user case when using
// /proc/thread-self/, but it's better to be safe than sorry (and there are
// only one or two truly atomic users of /proc/thread-self/).
runtime.LockOSThread()
threadSelf := "/proc/thread-self/"
if !haveProcThreadSelf {
// Pre-3.17 kernels did not have /proc/thread-self, so do it manually.
threadSelf = "/proc/self/task/" + strconv.Itoa(unix.Gettid()) + "/"
if _, err := os.Stat(threadSelf); err != nil {
// Unfortunately, this code is called from rootfs_linux.go where we
// are running inside the pid namespace of the container but /proc
// is the host's procfs. Unfortunately there is no real way to get
// the correct tid to use here (the kernel age means we cannot do
// things like set up a private fsopen("proc") -- even scanning
// NSpid in all of the tasks in /proc/self/task/*/status requires
// Linux 4.1).
//
// So, we just have to assume that /proc/self is acceptable in this
// one specific case.
if os.Getpid() == 1 {
logrus.Debugf("/proc/thread-self (tid=%d) cannot be emulated inside the initial container setup -- using /proc/self instead: %v", unix.Gettid(), err)
} else {
// This should never happen, but the fallback should work in most cases...
logrus.Warnf("/proc/thread-self could not be emulated for pid=%d (tid=%d) -- using more buggy /proc/self fallback instead: %v", os.Getpid(), unix.Gettid(), err)
}
threadSelf = "/proc/self/"
}
}
return threadSelf + subpath, runtime.UnlockOSThread
}
// ProcThreadSelfFd is small wrapper around ProcThreadSelf to make it easier to
// create a /proc/thread-self handle for given file descriptor.
//
// It is basically equivalent to ProcThreadSelf(fmt.Sprintf("fd/%d", fd)), but
// without using fmt.Sprintf to avoid unneeded overhead.
func ProcThreadSelfFd(fd uintptr) (string, ProcThreadSelfCloser) {
return ProcThreadSelf("fd/" + strconv.FormatUint(uint64(fd), 10))
}
// IsLexicallyInRoot is shorthand for strings.HasPrefix(path+"/", root+"/"),
// but properly handling the case where path or root are "/".
//
// NOTE: The return value only make sense if the path doesn't contain "..".
func IsLexicallyInRoot(root, path string) bool {
if root != "/" {
root += "/"
}
if path != "/" {
path += "/"
}
return strings.HasPrefix(path, root)
}
// MkdirAllInRootOpen attempts to make
//
// path, _ := securejoin.SecureJoin(root, unsafePath)
// os.MkdirAll(path, mode)
// os.Open(path)
//
// safer against attacks where components in the path are changed between
// SecureJoin returning and MkdirAll (or Open) being called. In particular, we
// try to detect any symlink components in the path while we are doing the
// MkdirAll.
//
// NOTE: Unlike os.MkdirAll, mode is not Go's os.FileMode, it is the unix mode
// (the suid/sgid/sticky bits are not the same as for os.FileMode).
//
// NOTE: If unsafePath is a subpath of root, we assume that you have already
// called SecureJoin and so we use the provided path verbatim without resolving
// any symlinks (this is done in a way that avoids symlink-exchange races).
// This means that the path also must not contain ".." elements, otherwise an
// error will occur.
//
// This uses securejoin.MkdirAllHandle under the hood, but it has special
// handling if unsafePath has already been scoped within the rootfs (this is
// needed for a lot of runc callers and fixing this would require reworking a
// lot of path logic).
func MkdirAllInRootOpen(root, unsafePath string, mode uint32) (_ *os.File, Err error) {
// If the path is already "within" the root, get the path relative to the
// root and use that as the unsafe path. This is necessary because a lot of
// MkdirAllInRootOpen callers have already done SecureJoin, and refactoring
// all of them to stop using these SecureJoin'd paths would require a fair
// amount of work.
// TODO(cyphar): Do the refactor to libpathrs once it's ready.
if IsLexicallyInRoot(root, unsafePath) {
subPath, err := filepath.Rel(root, unsafePath)
if err != nil {
return nil, err
}
unsafePath = subPath
}
// Check for any silly mode bits.
if mode&^0o7777 != 0 {
return nil, fmt.Errorf("tried to include non-mode bits in MkdirAll mode: 0o%.3o", mode)
}
// Linux (and thus os.MkdirAll) silently ignores the suid and sgid bits if
// passed. While it would make sense to return an error in that case (since
// the user has asked for a mode that won't be applied), for compatibility
// reasons we have to ignore these bits.
if ignoredBits := mode &^ 0o1777; ignoredBits != 0 {
logrus.Warnf("MkdirAll called with no-op mode bits that are ignored by Linux: 0o%.3o", ignoredBits)
mode &= 0o1777
}
rootDir, err := os.OpenFile(root, unix.O_DIRECTORY|unix.O_CLOEXEC, 0)
if err != nil {
return nil, fmt.Errorf("open root handle: %w", err)
}
defer rootDir.Close()
return securejoin.MkdirAllHandle(rootDir, unsafePath, int(mode))
}
// MkdirAllInRoot is a wrapper around MkdirAllInRootOpen which closes the
// returned handle, for callers that don't need to use it.
func MkdirAllInRoot(root, unsafePath string, mode uint32) error {
f, err := MkdirAllInRootOpen(root, unsafePath, mode)
if err == nil {
_ = f.Close()
}
return err
}
// Openat is a Go-friendly openat(2) wrapper.
func Openat(dir *os.File, path string, flags int, mode uint32) (*os.File, error) {
dirFd := unix.AT_FDCWD
if dir != nil {
dirFd = int(dir.Fd())
}
flags |= unix.O_CLOEXEC
fd, err := unix.Openat(dirFd, path, flags, mode)
if err != nil {
return nil, &os.PathError{Op: "openat", Path: path, Err: err}
}
return os.NewFile(uintptr(fd), dir.Name()+"/"+path), nil
}