Skip to content
This repository has been archived by the owner on Oct 22, 2024. It is now read-only.

Commit

Permalink
Merge pull request #938 from okartau/creation-size-fixes
Browse files Browse the repository at this point in the history
Namespace creation size-related fixes
  • Loading branch information
pohly authored Jun 16, 2021
2 parents 5538e08 + 7ae5a42 commit 58653fc
Show file tree
Hide file tree
Showing 7 changed files with 106 additions and 96 deletions.
11 changes: 8 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,14 @@ ARG GO_VERSION="1.16.1"
# run instead of just using some older, cached result.
ARG CACHEBUST

# We want newer ndctl that is available in buster:
RUN echo 'deb http://ftp.debian.org/debian buster-backports main' > /etc/apt/sources.list.d/buster-backports.list
RUN echo 'deb-src http://ftp.debian.org/debian buster-backports main' >> /etc/apt/sources.list.d/buster-backports.list
# In contrast to the runtime image below, here we can afford to install additional
# tools and recommended packages. But this image gets pushed to a registry by the CI as a cache,
# so it still makes sense to keep this layer small by removing /var/cache.
RUN ${APT_GET} update && \
${APT_GET} install -y gcc libndctl-dev make git curl iproute2 pkg-config xfsprogs e2fsprogs parted openssh-client python3 python3-venv equivs && \
${APT_GET} install -y gcc libndctl-dev/buster-backports make git curl iproute2 pkg-config xfsprogs e2fsprogs parted openssh-client python3 python3-venv equivs && \
rm -rf /var/cache/*
RUN curl -L https://dl.google.com/go/go${GO_VERSION}.linux-amd64.tar.gz | tar -zxf - -C / && \
mkdir -p /usr/local/bin/ && \
Expand Down Expand Up @@ -43,10 +46,12 @@ COPY --from=build python3_100.0_all.deb /var/cache/python3_100.0_all.deb
# lvm2 - volume management
# ndctl - pulls in the necessary library, useful by itself
# fio - only included in testing images
RUN echo 'deb http://ftp.debian.org/debian buster-backports main' > /etc/apt/sources.list.d/buster-backports.list
RUN echo 'deb-src http://ftp.debian.org/debian buster-backports main' >> /etc/apt/sources.list.d/buster-backports.list
RUN ${APT_GET} update && \
mkdir -p /usr/local/share && \
dpkg -i /var/cache/python3_100.0_all.deb && \
bash -c 'set -o pipefail; ${APT_GET} install -y --no-install-recommends file xfsprogs e2fsprogs lvm2 ndctl \
bash -c 'set -o pipefail; ${APT_GET} install -y --no-install-recommends file xfsprogs e2fsprogs lvm2 libndctl-dev/buster-backports ndctl/buster-backports \
| tee --append /usr/local/share/package-install.log' && \
rm -rf /var/cache/*

Expand Down Expand Up @@ -110,7 +115,7 @@ RUN cd /usr/local/share/package-sources && \
else \
echo " $pkg"; \
fi; \
done | sort -u; \
done && \
rm -rf /var/cache/*

# build pmem-csi-driver
Expand Down
34 changes: 33 additions & 1 deletion docs/design.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,10 @@
- [Communication between components](#communication-between-components)
- [Security](#security)
- [Volume Persistency](#volume-persistency)
- [Volume Size](#volume-size)
- [Capacity-aware pod scheduling](#capacity-aware-pod-scheduling)
- [PMEM-CSI operator](#pmem-csi-operator)

## Architecture and Operation

The PMEM-CSI driver can operate in two different device modes: *LVM* and
Expand Down Expand Up @@ -292,6 +293,37 @@ PMEM-CSI because they use the normal volume provisioning process.

See [exposing persistent and cache volumes](install.md#expose-persistent-and-cache-volumes-to-applications) for configuration information.

## Volume Size

The size of a volume reflects how much of the underlying storage that
is managed by PMEM-CSI is required for the volume. That size is also
what needs to be specified when requesting a volume.

For LVM, the number of blocks taken away from a volume group is the
same as the number of blocks in the new logical volume. For direct
mode, there is [some additional
overhead](https://docs.pmem.io/ndctl-user-guide/managing-namespaces#fsdax-and-devdax-capacity-considerations). PMEM-CSI
stores the additional meta data on the PMEM device (`--map=dev` in
ndctl) because that way, volumes can be used without affecting the
available DRAM on a node. The size of a namespace as listed by ndctl
refers to the usable size in the block device for the namespace, which
is less than the amount of PMEM reserved for the namespace in the
region and thus also less than the requested volume size.

In both modes, the filesystem created on the block device introduces
further overhead. The overhead for the filesystem and the additional
meta data in direct mode is something that users must consider when
deploying applications.

*Note*: Applications can request to map a file into memory that is too
large for the filesystem. Attempts to actually *use* all of the mapped
file then will lead to page faults once all available storage is
exhausted. Applications should use `fallocate` to ensure that this
won't happen. See [the memcached example
YAML](/deploy/kustomize/memcached/persistent/memcached-persistent.yaml)
for a way how to deal with this for applications that do not use
`fallocate` themselves.

## Capacity-aware pod scheduling

PMEM-CSI implements the CSI `GetCapacity` call, but Kubernetes
Expand Down
12 changes: 6 additions & 6 deletions pkg/ndctl/fake/region.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ type Region struct {
Enabled_ bool
Readonly_ bool
InterleaveWays_ uint64
RegionAlign_ uint64

Mappings_ []ndctl.Mapping
Namespaces_ []ndctl.Namespace
Expand Down Expand Up @@ -105,8 +106,11 @@ func (r *Region) SeedNamespace() ndctl.Namespace {
return nil
}

func (r *Region) GetAlign() uint64 {
return r.RegionAlign_
}

func (r *Region) CreateNamespace(opts ndctl.CreateNamespaceOpts) (ndctl.Namespace, error) {
defaultAlign := mib2
var err error
/* Set defaults */
if opts.Type == "" {
Expand Down Expand Up @@ -150,11 +154,7 @@ func (r *Region) CreateNamespace(opts ndctl.CreateNamespaceOpts) (ndctl.Namespac
if opts.Size > available {
return nil, fmt.Errorf("create namespace with size %v: %w", opts.Size, pmemerr.NotEnoughSpace)
}
}
opts.Align = defaultAlign

if opts.Size != 0 {
align := opts.Align
align := mib2
if opts.Size%align != 0 {
// Round up size to align with next block boundary.
opts.Size = (opts.Size/align + 1) * align
Expand Down
1 change: 0 additions & 1 deletion pkg/ndctl/ndctl.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,6 @@ type CreateNamespaceOpts struct {
Name string
Size uint64
SectorSize uint64
Align uint64
Type NamespaceType
Mode NamespaceMode
Location MapLocation
Expand Down
72 changes: 41 additions & 31 deletions pkg/ndctl/region.go
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,8 @@ type Region interface {
AdaptAlign(align uint64) (uint64, error)
// FsdaxAlignment returns the default alignment for an fsdax namespace.
FsdaxAlignment() (uint64, error)
// GetAlign returns region alignment.
GetAlign() uint64
}

type region = C.struct_ndctl_region
Expand Down Expand Up @@ -143,8 +145,11 @@ func (r *region) SeedNamespace() Namespace {
return C.ndctl_region_get_namespace_seed(r)
}

func (r *region) GetAlign() uint64 {
return uint64(C.ndctl_region_get_align(r))
}

func (r *region) CreateNamespace(opts CreateNamespaceOpts) (Namespace, error) {
defaultAlign := mib2
var err error
/* Set defaults */
if opts.Type == "" {
Expand Down Expand Up @@ -184,38 +189,26 @@ func (r *region) CreateNamespace(opts CreateNamespaceOpts) (Namespace, error) {
}
}

if opts.Size != 0 {
available := r.MaxAvailableExtent()
if available == uint64(C.ULLONG_MAX) {
available = r.AvailableSize()
}
if opts.Size > available {
return nil, fmt.Errorf("create namespace with size %v: %w", opts.Size, pmemerr.NotEnoughSpace)
}
align := mib2
namespacealign := align * r.InterleaveWays()
regionalign := r.GetAlign()
// Size has to be aligned both by namespace alignment times interleave_ways, and also by region alignment
lcmalign := LCM(namespacealign, regionalign)
klog.V(3).Infof("%s: Least Common Multiple of namespacealign:%d and regionalign:%d is %d",
regionName, namespacealign, regionalign, lcmalign)
if opts.Size == 0 || opts.Size%lcmalign != 0 {
// Align up to least-common-multiple alignment boundary.
alignedsize := (opts.Size/lcmalign + 1) * lcmalign
klog.V(3).Infof("%s: namespace size must align to LCM alignment:%d, adjust up from %d to %d",
regionName, lcmalign, opts.Size, alignedsize)
opts.Size = alignedsize
}

align := defaultAlign
if opts.Align != 0 {
if opts.Mode == SectorMode || opts.Mode == RawMode {
klog.V(4).Infof("%s mode does not support setting an alignment, hence ignoring alignment", opts.Mode)
} else {
var err error
align, err = r.AdaptAlign(opts.Align)
if err != nil {
return nil, err
}
}
available := r.MaxAvailableExtent()
if available == uint64(C.ULLONG_MAX) {
available = r.AvailableSize()
}

if opts.Size != 0 {
ways := uint64(C.ndctl_region_get_interleave_ways(r))
align = align * ways
if opts.Size%align != 0 {
// Round up size to align with next block boundary.
opts.Size = (opts.Size/align + 1) * align
klog.V(4).Infof("%s: namespace size must align to interleave-width:%d * alignment:%d, force-align to %d",
regionName, ways, align, opts.Size)
}
if opts.Size > available {
return nil, fmt.Errorf("create namespace with size %v: %w", opts.Size, pmemerr.NotEnoughSpace)
}

/* setup_namespace */
Expand Down Expand Up @@ -273,6 +266,7 @@ func (r *region) CreateNamespace(opts CreateNamespaceOpts) (Namespace, error) {
return nil, err
}

klog.V(3).Infof("%s: Namespace created: size:%d uuid:%v", regionName, ns.Size(), ns.UUID())
return ns, nil
}

Expand Down Expand Up @@ -379,3 +373,19 @@ func (r *region) namespaces(onlyActive bool) []Namespace {

return namespaces
}

// Functions GCD and LCM borrowed from Go playground, simplified for 2 arguments.
// greatest common divisor (GCD) via Euclidean algorithm
func GCD(a, b uint64) uint64 {
for b != 0 {
t := b
b = a % b
a = t
}
return a
}

// find Least Common Multiple (LCM) via GCD
func LCM(a, b uint64) uint64 {
return a * b / GCD(a, b)
}
39 changes: 9 additions & 30 deletions pkg/pmem-device-manager/pmd-lvm.go
Original file line number Diff line number Diff line change
Expand Up @@ -309,34 +309,19 @@ func getVolumeGroups(groups []string) ([]vgInfo, error) {
return vgs, nil
}

const (
GB uint64 = 1024 * 1024 * 1024
)

// setupNS checks if a namespace needs to be created in the region and if so, does that.
func setupNS(r ndctl.Region, percentage uint) error {
align := GB
// In doc for "ndctl create-namespace" https://pmem.io/ndctl/ndctl-create-namespace.html
// it is stated that:
// For pmem namepsaces the size must be a multiple of the interleave-width and the namespace alignment.
// Because "align" is already used for argument we pass into r.CreateNamespace,
// we use "realalign" for multiplied alignment value required by above requirement.
realalign := align * r.InterleaveWays()
canUse := uint64(percentage) * r.Size() / 100
klog.V(3).Infof("Create fsdax-namespaces in %v, allowed %d %%, real align %d:\ntotal : %16d\navail : %16d\ncan use : %16d",
r.DeviceName(), percentage, realalign, r.Size(), r.AvailableSize(), canUse)
klog.V(3).Infof("Create fsdax-namespaces in %v, allowed %d %%\ntotal : %16d\navail : %16d\ncan use : %16d",
r.DeviceName(), percentage, r.Size(), r.AvailableSize(), canUse)
// Subtract sizes of existing active namespaces with currently handled mode and owned by pmem-csi
for _, ns := range r.ActiveNamespaces() {
klog.V(5).Infof("setupNS: Exists: Size %16d Mode:%v Device:%v Name:%v", ns.Size(), ns.Mode(), ns.DeviceName(), ns.Name())
if ns.Name() != pmemCSINamespaceName {
continue
}
diff := int64(canUse - ns.Size())
if diff <= 0 {
canUse = 0
} else {
canUse = uint64(diff)
}
klog.V(5).Infof("setupNS: Found owned-by-self namespace of size:%d, stop processing this region", ns.Size())
return nil
}
klog.V(4).Infof("Calculated canUse:%v, available by Region info:%v", canUse, r.AvailableSize())
// Because of overhead by alignment and extra space for page mapping, calculated available may show more than actual
Expand All @@ -349,18 +334,12 @@ func setupNS(r ndctl.Region, percentage uint) error {
klog.V(4).Infof("MaxAvailableExtent in Region:%v is less than desired size, limit to that", r.MaxAvailableExtent())
canUse = r.MaxAvailableExtent()
}
// Align down to next real alignment boundary, as trying creation above it may fail.
canUse /= realalign
canUse *= realalign
// If less than 2GB usable, don't attempt as creation would fail
minsize := 2 * GB
if canUse >= minsize {
klog.V(3).Infof("Create %v-bytes fsdax-namespace", canUse)
if canUse > 0 {
klog.V(3).Infof("Create fsdax-namespace with size:%d", canUse)
_, err := r.CreateNamespace(ndctl.CreateNamespaceOpts{
Name: "pmem-csi",
Mode: "fsdax",
Size: canUse,
Align: align,
Name: "pmem-csi",
Mode: "fsdax",
Size: canUse,
})
if err != nil {
return fmt.Errorf("failed to create PMEM namespace with size '%d' in region '%s': %v", canUse, r.DeviceName(), err)
Expand Down
33 changes: 9 additions & 24 deletions pkg/pmem-device-manager/pmd-ndctl.go
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,6 @@ import (
"k8s.io/utils/mount"
)

const (
// 1 GB align in ndctl creation request has proven to be reliable.
// Newer kernels may allow smaller alignment but we do not want to introduce kernel dependency.
ndctlAlign uint64 = 1024 * 1024 * 1024
)

type pmemNdctl struct {
pmemPercentage uint
}
Expand Down Expand Up @@ -120,13 +114,13 @@ func (pmem *pmemNdctl) GetCapacity() (capacity Capacity, err error) {
continue
}

realalign := ndctlAlign * r.InterleaveWays()
regionalign := r.GetAlign()
available := r.MaxAvailableExtent()
// align down, avoid claiming more than what we really can serve
klog.V(4).Infof("GetCapacity: available before realalign: %d", available)
available /= realalign
available *= realalign
klog.V(4).Infof("GetCapacity: available after realalign: %d", available)
klog.V(4).Infof("GetCapacity: initial available value by Region state: %d", available)
// align down by regionalign, avoid claiming having more than what we really can serve
available /= regionalign
available *= regionalign
klog.V(4).Infof("GetCapacity: available after regionalign down: %d", available)
if available > capacity.MaxVolumeSize {
capacity.MaxVolumeSize = available
}
Expand Down Expand Up @@ -158,19 +152,10 @@ func (pmem *pmemNdctl) CreateDevice(volumeId string, size uint64) error {
return pmemerr.DeviceExists
}

// libndctl needs to store meta data and will use some of the allocated
// space for that (https://github.com/pmem/ndctl/issues/79).
// We don't know exactly how much space that is, just
// that it should be a small amount. But because libndctl
// rounds up to the alignment, in practice that means we need
// to request `align` additional bytes.
size += ndctlAlign
klog.V(4).Infof("Compensate for libndctl creating one alignment step smaller: increase size to %d", size)
ns, err := ndctl.CreateNamespace(ndctx, ndctl.CreateNamespaceOpts{
Name: volumeId,
Size: size,
Align: ndctlAlign,
Mode: "fsdax",
Name: volumeId,
Size: size,
Mode: "fsdax",
})
if err != nil {
return err
Expand Down

0 comments on commit 58653fc

Please sign in to comment.