Commit 986cf86e authored by Kirill Smelkov's avatar Kirill Smelkov

wcfs: client: Provide virtmem integration

Provide integration with virtmem, so that WCFS Mapping can be associated
and managed under virtmem VMA. In other words provide support so that WCFS can
be used as ZBigFile backend in "mmap overlay" mode (see fae045cc "bigfile/virtmem:
Introduce "mmap overlay" mode" for description of mmap-overlay mode).

We'll need this functionality for ZBigFile + WCFS client integration.

Virtmem integration will be tested via running whole wendelin.core functional
testsuite in wcfs-mode after the next patch.

Quoting added description:

---- 8< ----

Integration with wendelin.core virtmem layer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This client package can be used standalone, but additionally provides
integration with wendelin.core userspace virtual memory manager: when a
Mapping is created, it can be associated as serving base layer for a
particular virtmem VMA via FileH.mmap(vma=...). In that case, since virtmem
itself adds another layer of dirty pages over read-only base provided by
Mapping(+)

                 ┌──┐                      ┌──┐
                 │RW│                      │RW│    ← virtmem VMA dirty pages
                 └──┘                      └──┘
                           +
                                                   VMA base = X@at view provided by Mapping:

                                          ___        /@revA/bigfile/X
        __                                           /@revB/bigfile/X
               _                                     /@revC/bigfile/X
                           +                         ...
     ───  ───── ──────────────────────────   ─────   /head/bigfile/X

the Mapping will interact with virtmem layer to coordinate
updates to mapping virtual memory.

How it works
~~~~~~~~~~~~

Wcfs client integrates with virtmem layer to support virtmem handle
dirtying pages of read-only base-layer that wcfs client provides via
isolated Mapping. For wcfs-backed bigfiles every virtmem VMA is interlinked
with Mapping:

      VMA     -> BigFileH -> ZBigFile -----> Z
       ↑↓                                    O
     Mapping  -> FileH    -> wcfs server --> DB

When a page is write-accessed, virtmem mmaps in a page of RAM in place of
accessed virtual memory, copies base-layer content provided by Mapping into
there, and marks that page as read-write.

Upon receiving pin message, the pinner consults virtmem, whether
corresponding page was already dirtied in virtmem's BigFileH (call to
__fileh_page_isdirty), and if it was, the pinner does not remmap Mapping
part to wcfs/@revX/f and just leaves dirty page in its place, remembering
pin information in fileh._pinned.

Once dirty pages are no longer needed (either after discard/abort or
writeout/commit), virtmem asks wcfs client to remmap corresponding regions
of Mapping in its place again via calls to Mapping.remmap_blk for previously
dirtied blocks.

The scheme outlined above does not need to split Mapping upon dirtying an
inner page.

See bigfile_ops interface (wendelin/bigfile/file.h) that explains base-layer
and overlaying from virtmem point of view. For wcfs this interface is
provided by small wcfs client wrapper in bigfile/file_zodb.cpp.

(+) see bigfile_ops interface (wendelin/bigfile/file.h) that gives virtmem
    point of view on layering.

----------------------------------------

Some preliminary history:

kirr/wendelin.core@f330bd2f    X wcfs/client: Overview += interaction with virtmem layer
parent e11edc70
......@@ -60,8 +60,7 @@ static int __ram_reclaim(RAM *ram);
/* global lock which protects manipulating virtmem data structures
*
* NOTE not scalable, but this is temporary solution - as we are going to move
* memory management back into the kernel, where it is done properly. */
* NOTE not scalable. */
static pthread_mutex_t virtmem_lock = PTHREAD_ERRORCHECK_MUTEX_INITIALIZER_NP;
static const VirtGilHooks *virtmem_gilhooks;
......
......@@ -94,7 +94,7 @@ struct bigfile_ops {
* dirtied pages that are layed over base data layer provided by the
* mappings.
*
* The primary user of this functionality will be wcfs - virtual filesystem that
* The primary user of this functionality is wcfs - virtual filesystem that
* provides access to ZBigFile data via OS-level files(*). The layering can
* be schematically depicted as follows
*
......
......@@ -39,7 +39,7 @@
* are dirtied. The mode in which BigFile handle is opened is specified via
* fileh_open(flags=...).
*
* The primary user of "mmap overlay" functionality will be wcfs - virtual
* The primary user of "mmap overlay" functionality is wcfs - virtual
* filesystem that provides access to ZBigFile data via OS-level files(*).
*
* (*) see wcfs/client/wcfs.h and wcfs/wcfs.go
......@@ -171,7 +171,7 @@ struct VMA {
* MMAP_OVERLAY flag. bigfile_ops.mmap_setup_read can initialize this to
* object pointer specific to serving created base overlay mapping.
*
* For example WCFS will use this to link VMA -> wcfs.Mapping to know which
* For example WCFS uses this to link VMA -> wcfs.Mapping to know which
* wcfs-specific mapping is serving particular virtmem VMA.
*
* NULL for VMAs created from under DONT_MMAP_OVERLAY fileh. */
......
......@@ -318,7 +318,8 @@ setup(
['wcfs/client/wcfs.cpp',
'wcfs/client/wcfs_watchlink.cpp',
'wcfs/client/wcfs_misc.cpp'],
depends = libwcfs_h)],
depends = libvirtmem_h + libwcfs_h,
dsos = ['wendelin.bigfile.libvirtmem'])],
ext_modules = [
PyGoExt('wendelin.bigfile._bigfile',
......@@ -333,14 +334,14 @@ setup(
PyGoExt('wendelin.wcfs.client._wcfs',
['wcfs/client/_wcfs.pyx'],
depends = libwcfs_h,
depends = libwcfs_h + libvirtmem_h,
dsos = ['wendelin.wcfs.client.libwcfs']),
PyGoExt('wendelin.wcfs.client._wczsync',
['wcfs/client/_wczsync.pyx'],
depends = [
'wcfs/client/_wcfs.pxd',
] + libwcfs_h,
] + libwcfs_h + libvirtmem_h,
dsos = ['wendelin.wcfs.client.libwcfs']),
PyGoExt('wendelin.wcfs.internal.wcfs_test',
......
......@@ -17,7 +17,7 @@
// See COPYING file for full licensing terms.
// See https://www.nexedi.com/licensing for rationale and options.
// Package wcfs provides WCFS client.
// Package wcfs provides WCFS client integrated with user-space virtual memory manager.
// See wcfs.h for package overview.
......@@ -56,8 +56,9 @@
// points like Conn.open, Conn.resync and FileH.close.
//
// Every FileH maintains fileh._pinned {} with currently pinned blk -> rev. This
// dict is updated by pinner driven by pin messages, and is used when
// new fileh Mapping is created (FileH.mmap).
// dict is updated by pinner driven by pin messages, and is used when either
// new fileh Mapping is created (FileH.mmap) or refreshed due to request from
// virtmem (Mapping.remmap_blk, see below).
//
// In wendelin.core a bigfile has semantic that it is infinite in size and
// reads as all zeros beyond region initialized with data. Memory-mapping of
......@@ -70,6 +71,40 @@
// wcfs/head/f at every transaction boundary (Conn.resync) and remembers f.size
// in FileH._headfsize for use during one transaction(%).
//
//
// Integration with wendelin.core virtmem layer
//
// Wcfs client integrates with virtmem layer to support virtmem handle
// dirtying pages of read-only base-layer that wcfs client provides via
// isolated Mapping. For wcfs-backed bigfiles every virtmem VMA is interlinked
// with Mapping:
//
// VMA -> BigFileH -> ZBigFile -----> Z
// ↑↓ O
// Mapping -> FileH -> wcfs server --> DB
//
// When a page is write-accessed, virtmem mmaps in a page of RAM in place of
// accessed virtual memory, copies base-layer content provided by Mapping into
// there, and marks that page as read-write.
//
// Upon receiving pin message, the pinner consults virtmem, whether
// corresponding page was already dirtied in virtmem's BigFileH (call to
// __fileh_page_isdirty), and if it was, the pinner does not remmap Mapping
// part to wcfs/@revX/f and just leaves dirty page in its place, remembering
// pin information in fileh._pinned.
//
// Once dirty pages are no longer needed (either after discard/abort or
// writeout/commit), virtmem asks wcfs client to remmap corresponding regions
// of Mapping in its place again via calls to Mapping.remmap_blk for previously
// dirtied blocks.
//
// The scheme outlined above does not need to split Mapping upon dirtying an
// inner page.
//
// See bigfile_ops interface (wendelin/bigfile/file.h) that explains base-layer
// and overlaying from virtmem point of view. For wcfs this interface is
// provided by small wcfs client wrapper in bigfile/file_zodb.cpp.
//
// --------
//
// (*) see wcfs.go documentation for WCFS isolation protocol overview and details.
......@@ -126,18 +161,27 @@
// Note that FileH.mmapMu is regular - not RW - mutex, since nothing in wcfs
// client calls into wcfs server via watchlink with mmapMu held.
//
// To synchronize with virtmem layer, wcfs client takes and releases big
// virtmem lock around places that touch virtmem (calls to virt_lock and
// virt_unlock). Also virtmem calls several wcfs client entrypoints with
// virtmem lock already taken. Thus, to avoid AB-BA style deadlocks, wcfs
// client needs to take virtmem lock as the first lock, whenever it needs to
// take both virtmem lock, and another lock - e.g. atMu(%).
//
// The ordering of locks is:
//
// Conn.atMu > Conn.filehMu > FileH.mmapMu
// virt_lock > Conn.atMu > Conn.filehMu > FileH.mmapMu
//
// The pinner takes the following locks:
//
// - virt_lock
// - wconn.atMu.R
// - wconn.filehMu.R
// - fileh.mmapMu (to read .mmaps + write .pinned)
//
//
// (*) see "Wcfs locking organization" in wcfs.go
// (%) see related comment in Conn.__pin1 for details.
// Handling of fork
......@@ -164,6 +208,9 @@
#include "wcfs.h"
#include "wcfs_watchlink.h"
#include <wendelin/bigfile/virtmem.h>
#include <wendelin/bigfile/ram.h>
#include <golang/errors.h>
#include <golang/fmt.h>
#include <golang/io.h>
......@@ -195,10 +242,6 @@ namespace ioutil = io::ioutil;
// trace with op prefix taken from E.
#define etrace(format, ...) trace("%s", v(E(fmt::errorf(format, ##__VA_ARGS__))))
#define ASSERT(expr) do { \
if (!(expr)) \
panic("assert failed: " #expr); \
} while(0)
// wcfs::
namespace wcfs {
......@@ -301,6 +344,15 @@ error _Conn::close() {
// NOTE keep in sync with Conn.afterFork
_Conn& wconn = *this;
// lock virtmem early. TODO more granular virtmem locking (see __pin1 for
// details and why virt_lock currently goes first)
virt_lock();
bool virtUnlocked = false;
defer([&]() {
if (!virtUnlocked)
virt_unlock();
});
wconn._atMu.RLock();
defer([&]() {
wconn._atMu.RUnlock();
......@@ -356,6 +408,7 @@ error _Conn::close() {
}
// force fileh close.
// - virt_lock
// - wconn.atMu.R
// - wconn.filehMu unlocked
err = f->_closeLocked(/*force=*/true);
......@@ -369,6 +422,10 @@ error _Conn::close() {
}
// close wlink and signal to pinner to stop.
// we have to release virt_lock, to avoid deadlocking with pinner.
virtUnlocked = true;
virt_unlock();
err = wconn._wlink->close();
if (err != nil)
reterr1(err);
......@@ -501,6 +558,31 @@ error _Conn::__pin1(PinReq *req) {
FileH f;
bool ok;
// lock virtmem first.
//
// The reason we do it here instead of closely around call to
// mmap->_remmapblk() is to avoid deadlocks: virtmem calls FileH.mmap,
// Mapping.remmap_blk and Mapping.unmap under virt_lock locked. In those
// functions the order of locks is
//
// virt_lock, wconn.atMu.R, fileh.mmapMu
//
// So if we take virt_lock right around mmap._remmapblk(), the order of
// locks in pinner would be
//
// wconn.atMu.R, wconn.filehMu.R, fileh.mmapMu, virt_lock
//
// which means there is AB-BA deadlock possibility.
//
// TODO try to take virt_lock only around virtmem-associated VMAs and with
// better granularity. NOTE it is possible to teach virtmem to call
// FileH.mmap and Mapping.unmap without virtmem locked. However reworking
// virtmem to call Mapping.remmap_blk without virt_lock is not so easy.
virt_lock();
defer([&]() {
virt_unlock();
});
wconn._atMu.RLock();
defer([&]() {
wconn._atMu.RUnlock();
......@@ -541,7 +623,27 @@ error _Conn::__pin1(PinReq *req) {
trace("\tremmapblk %d @%s", req->blk, (req->at == TidHead ? "head" : v(req->at)));
error err = mmap->_remmapblk(req->blk, req->at);
// pin only if virtmem did not dirtied page corresponding to this block already
// if virtmem dirtied the page - it will ask us to remmap it again after commit or abort.
bool do_pin= true;
error err;
if (mmap->vma != nil) {
mmap->_assertVMAOk();
// see ^^^ about deadlock
//virt_lock();
BigFileH *virt_fileh = mmap->vma->fileh;
TODO (mmap->fileh->blksize != virt_fileh->ramh->ram->pagesize);
do_pin = !__fileh_page_isdirty(virt_fileh, req->blk);
}
if (do_pin)
err = mmap->_remmapblk(req->blk, req->at);
// see ^^^ about deadlock
//if (mmap->vma != nil)
// virt_unlock();
// on error don't need to continue with other mappings - all fileh and
// all mappings become marked invalid on pinner failure.
......@@ -894,6 +996,13 @@ error _FileH::close() {
_FileH& fileh = *this;
Conn wconn = fileh.wconn;
// lock virtmem early. TODO more granular virtmem locking (see __pin1 for
// details and why virt_lock currently goes first)
virt_lock();
defer([&]() {
virt_unlock();
});
wconn->_atMu.RLock();
defer([&]() {
wconn->_atMu.RUnlock();
......@@ -905,6 +1014,7 @@ error _FileH::close() {
// _closeLocked serves FileH.close and Conn.close.
//
// Must be called with the following locks held by caller:
// - virt_lock
// - wconn.atMu
error _FileH::_closeLocked(bool force) {
// NOTE keep in sync with FileH._afterFork
......@@ -1017,9 +1127,13 @@ void _FileH::_afterFork() {
}
// mmap creates file mapping representing file[blk_start +blk_len) data as of wconn.at database state.
pair<Mapping, error> _FileH::mmap(int64_t blk_start, int64_t blk_len) {
//
// If vma != nil, created mapping is associated with that vma of user-space virtual memory manager:
// virtmem calls FileH::mmap under virtmem lock when virtmem fileh is mmapped into vma.
pair<Mapping, error> _FileH::mmap(int64_t blk_start, int64_t blk_len, VMA *vma) {
_FileH& f = *this;
// NOTE virtmem lock is held by virtmem caller
f.wconn->_atMu.RLock(); // e.g. f._headfsize
f.wconn->_filehMu.RLock(); // f._state TODO -> finer grained (currently too coarse)
f._mmapMu.lock(); // f._pinned, f._mmaps
......@@ -1080,6 +1194,7 @@ pair<Mapping, error> _FileH::mmap(int64_t blk_start, int64_t blk_len) {
mmap->blk_start = blk_start;
mmap->mem_start = mem_start;
mmap->mem_stop = mem_stop;
mmap->vma = vma;
mmap->efaulted = false;
for (auto _ : f._pinned) { // TODO keep f._pinned ↑blk and use binary search
......@@ -1092,6 +1207,18 @@ pair<Mapping, error> _FileH::mmap(int64_t blk_start, int64_t blk_len) {
return make_pair(nil, E(err));
}
if (vma != nil) {
if (vma->mmap_overlay_server != nil)
panic("vma is already associated with overlay server");
if (!(vma->addr_start == 0 && vma->addr_stop == 0))
panic("vma already covers !nil virtual memory area");
mmap->incref(); // vma->mmap_overlay_server is keeping ref to mmap
vma->mmap_overlay_server = mmap._ptr();
vma->addr_start = (uintptr_t)mmap->mem_start;
vma->addr_stop = (uintptr_t)mmap->mem_stop;
mmap->_assertVMAOk(); // just in case
}
f._mmaps.push_back(mmap); // TODO keep f._mmaps ↑blk_start
retok = true;
......@@ -1105,6 +1232,7 @@ pair<Mapping, error> _FileH::mmap(int64_t blk_start, int64_t blk_len) {
// correct f@at data view.
//
// Must be called with the following locks held by caller:
// - virt_lock
// - fileh.mmapMu
error _Mapping::__remmapAsEfault() {
_Mapping& mmap = *this;
......@@ -1138,10 +1266,14 @@ error _Mapping::__remmapBlkAsEfault(int64_t blk) {
// unmap releases mapping memory from address space.
//
// After call to unmap the mapping must no longer be used.
// The association in between mapping and linked virtmem VMA is reset.
//
// Virtmem calls Mapping.unmap under virtmem lock when VMA is unmapped.
error _Mapping::unmap() {
Mapping mmap = newref(this); // newref for std::remove
FileH f = mmap->fileh;
// NOTE virtmem lock is held by virtmem caller
f->wconn->_atMu.RLock();
f->_mmapMu.lock();
defer([&]() {
......@@ -1156,6 +1288,16 @@ error _Mapping::unmap() {
if (mmap->mem_start == nil)
return nil;
if (mmap->vma != nil) {
mmap->_assertVMAOk();
VMA *vma = mmap->vma;
vma->mmap_overlay_server = nil;
mmap->decref(); // vma->mmap_overlay_server was holding a ref to mmap
vma->addr_start = 0;
vma->addr_stop = 0;
mmap->vma = nil;
}
error err = mm::unmap(mmap->mem_start, mmap->mem_stop - mmap->mem_start);
mmap->mem_start = nil;
mmap->mem_stop = nil;
......@@ -1171,6 +1313,7 @@ error _Mapping::unmap() {
// _remmapblk remmaps mapping memory for file[blk] to be viewing database as of @at state.
//
// at=TidHead means unpin to head/ .
// NOTE this does not check whether virtmem already mapped blk as RW.
//
// _remmapblk must not be called after Mapping is switched to efault.
//
......@@ -1232,6 +1375,50 @@ error _Mapping::_remmapblk(int64_t blk, zodb::Tid at) {
return nil;
}
// remmap_blk remmaps file[blk] in its place again.
//
// Virtmem calls Mapping.remmap_blk under virtmem lock to remmap a block after
// RW dirty page was e.g. discarded or committed.
error _Mapping::remmap_blk(int64_t blk) {
_Mapping& mmap = *this;
FileH f = mmap.fileh;
error err;
// NOTE virtmem lock is held by virtmem caller
f->wconn->_atMu.RLock();
f->_mmapMu.lock();
defer([&]() {
f->_mmapMu.unlock();
f->wconn->_atMu.RUnlock();
});
xerr::Contextf E("%s: %s: %s: remmapblk #%ld", v(f->wconn), v(f), v(mmap), blk);
etrace("");
if (!(mmap.blk_start <= blk && blk < mmap.blk_stop()))
panic("remmap_blk: blk out of Mapping range");
// it should not happen, but if, for a efaulted mapping, virtmem asks us to
// remmap base-layer blk memory in its place again, we reinject efault into it.
if (mmap.efaulted) {
log::Warnf("%s: remmapblk called for already-efaulted mapping", v(mmap));
return E(mmap.__remmapBlkAsEfault(blk));
}
// blkrev = rev | @head
zodb::Tid blkrev; bool ok;
tie(blkrev, ok) = f->_pinned.get_(blk);
if (!ok)
blkrev = TidHead;
err = mmap._remmapblk(blk, blkrev);
if (err != nil)
return E(err);
return nil;
}
// ---- WCFS raw file access ----
// _path returns path for object on wcfs.
......@@ -1301,6 +1488,25 @@ static error mmap_into_ro(void *addr, size_t size, os::File f, off_t offset) {
}
// _assertVMAOk() verifies that mmap and vma are related to each other and cover
// exactly the same virtual memory range.
//
// It panics if mmap and vma do not exactly relate to each other or cover
// different virtual memory range.
void _Mapping::_assertVMAOk() {
_Mapping* mmap = this;
VMA *vma = mmap->vma;
if (!(vma->mmap_overlay_server == static_cast<void*>(mmap)))
panic("BUG: mmap and vma do not link to each other");
if (!(vma->addr_start == uintptr_t(mmap->mem_start) &&
vma->addr_stop == uintptr_t(mmap->mem_stop)))
panic("BUG: mmap and vma cover different virtual memory ranges");
// verified ok
}
string WCFS::String() const {
const WCFS& wc = *this;
return fmt::sprintf("wcfs %s", v(wc.mountpoint));
......
......@@ -17,7 +17,7 @@
// See COPYING file for full licensing terms.
// See https://www.nexedi.com/licensing for rationale and options.
// Package wcfs provides WCFS client.
// Package wcfs provides WCFS client integrated with user-space virtual memory manager.
//
// This client package takes care about WCFS isolation protocol details and
// provides to clients simple interface to isolated view of bigfile data on
......@@ -46,6 +46,31 @@
// to maintain X@at data view according to WCFS isolation protocol(*).
//
//
// Integration with wendelin.core virtmem layer
//
// This client package can be used standalone, but additionally provides
// integration with wendelin.core userspace virtual memory manager: when a
// Mapping is created, it can be associated as serving base layer for a
// particular virtmem VMA via FileH.mmap(vma=...). In that case, since virtmem
// itself adds another layer of dirty pages over read-only base provided by
// Mapping(+)
//
// ┌──┐ ┌──┐
// │RW│ │RW│ ← virtmem VMA dirty pages
// └──┘ └──┘
// +
// VMA base = X@at view provided by Mapping:
//
// ___ /@revA/bigfile/X
// __ /@revB/bigfile/X
// _ /@revC/bigfile/X
// + ...
// ─── ───── ────────────────────────── ───── /head/bigfile/X
//
// the Mapping will interact with virtmem layer to coordinate
// updates to mapping virtual memory.
//
//
// API overview
//
// - `WCFS` represents filesystem-level connection to wcfs server.
......@@ -67,6 +92,8 @@
// --------
//
// (*) see wcfs.go documentation for WCFS isolation protocol overview and details.
// (+) see bigfile_ops interface (wendelin/bigfile/file.h) that gives virtmem
// point of view on layering.
#ifndef _NXD_WCFS_H_
#define _NXD_WCFS_H_
......@@ -79,6 +106,12 @@
#include <utility>
#include "wcfs_misc.h"
#include <wendelin/bug.h>
// from wendelin/bigfile/virtmem.h
extern "C" {
struct VMA;
}
// wcfs::
......@@ -230,7 +263,7 @@ public:
public:
error close();
pair<Mapping, error> mmap(int64_t blk_start, int64_t blk_len);
pair<Mapping, error> mmap(int64_t blk_start, int64_t blk_len, VMA *vma=nil);
string String() const;
error _open();
......@@ -252,16 +285,18 @@ struct _Mapping : object {
// protected by fileh._mmapMu
uint8_t *mem_start; // mmapped memory [mem_start, mem_stop)
uint8_t *mem_stop;
VMA *vma; // mmapped under this virtmem VMA | nil if created standalone from virtmem
bool efaulted; // y after mapping was switched to be invalid (gives SIGSEGV on access)
int64_t blk_stop() const {
if (!((mem_stop - mem_start) % fileh->blksize == 0))
panic("len(mmap) % fileh.blksize != 0");
ASSERT((mem_stop - mem_start) % fileh->blksize == 0);
return blk_start + (mem_stop - mem_start) / fileh->blksize;
}
error remmap_blk(int64_t blk); // for virtmem-only
error unmap();
void _assertVMAOk();
error _remmapblk(int64_t blk, zodb::Tid at);
error __remmapAsEfault();
error __remmapBlkAsEfault(int64_t blk);
......@@ -270,7 +305,7 @@ struct _Mapping : object {
private:
_Mapping();
~_Mapping();
friend pair<Mapping, error> _FileH::mmap(int64_t blk_start, int64_t blk_len);
friend pair<Mapping, error> _FileH::mmap(int64_t blk_start, int64_t blk_len, VMA *vma);
public:
void decref();
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment