pyrfs — Architecture¶
A Pythonic port of R's
fs· Status: design draft · Last updated: 2026-06-11 Companion:pyrfs-ux.md(user-facing design)
1. Purpose & non-goals¶
Purpose. Give Python the same file-system ergonomics that R users enjoy from fs:
consistent noun_verb naming families, tidy paths, predictable path-carrying return values,
explicit failure, and typed self-describing values (human-readable sizes, rwxr-xr-x
permissions) — while being chainable/pipeable and integrating natively with pandas.
What pyrfs is. A thin, ergonomic, fully-typed wrapper over the Python standard library
(pathlib, shutil, os, stat, pwd/grp) plus an optional pandas integration layer.
Non-goals.
- Not a new filesystem abstraction over remote/cloud backends (that's fsspec/PyFilesystem2).
- Not a C/native extension. R's fs needed libuv for cross-platform syscalls; Python's
stdlib already abstracts that, so pyrfs is pure Python — no build step, trivial install.
- Not a 1:1 transliteration. We keep fs's UX contract, expressed in idiomatic Python.
2. Core principle — one engine, three surfaces¶
Every filesystem operation is implemented once in a pure-stdlib _engine. The three
user-facing surfaces are thin delegations — no logic is duplicated across them.
flowchart TD
subgraph surfaces["User-facing surfaces"]
fn["Functional API<br/>file_copy(a, b)<br/>dir_ls(p) · path_ext(p)"]
fp["Fluent FsPath<br/>FsPath(a).copy_to(b)<br/>(FsPath(p) / 'x').with_ext('md')"]
acc["pandas .fs accessor<br/>df['path'].fs.ext()<br/>dir_info(p) -> DataFrame"]
end
eng["pyrfs._engine<br/>(pure stdlib, no pandas)<br/>paths · fileops · dirops · linkops · ids · temp"]
std[("Python stdlib<br/>pathlib · shutil · os · stat · pwd/grp")]
fn --> eng
fp --> eng
acc --> eng
eng --> std
Why this matters: fs itself uses this idea — high-level R verbs compose from a small set of
C primitives. pyrfs applies it in pure Python: the fluent object and the pandas accessor are
presentation layers, and correctness lives in one place.
3. System context¶
flowchart LR
user([Python user / data scientist])
subgraph pyrfs["pyrfs"]
core["core API + FsPath + typed values"]
pdx["optional pandas layer"]
end
pandas{{"pandas (optional extra)"}}
std[("OS filesystem via stdlib")]
user -->|"file_*/dir_*/path_* · FsPath · Series.fs"| pyrfs
core --> std
core -.->|"lazily, if installed"| pdx
pdx --> pandas
- Inbound: scripts, notebooks, and packages call pyrfs.
- Hard dependency: none beyond the standard library (Python ≥ 3.10).
- Optional: pandas — enables
*_infoDataFrames, the.fsSeries accessor, and the ExtensionDtypes. Absent pandas, the core still works and*_inforeturnslist[dict].
4. Package layout (flat layout)¶
The importable package sits at the top level (pyrfs/pyrfs/), not under src/.
pyrfs/ # repo root
├── pyproject.toml # setuptools backend, [project], optional-deps, tooling
├── docs/ # these design docs
├── pyrfs/ # the importable package
│ ├── __init__.py # PUBLIC re-exports (functions + FsPath/Bytes/Perms + FsError)
│ ├── py.typed # PEP 561 marker (ships type info)
│ ├── errors.py # FsError hierarchy (validation)
│ ├── fspath.py # FsPath(str) — fluent, chainable [PUBLIC]
│ ├── values.py # Bytes(int), Perms(int) — typed scalars [PUBLIC]
│ ├── display.py # humanize bytes · perms→rwx · LS_COLORS · tidy
│ ├── _engine/ # pure-stdlib core (NEVER imports pandas)
│ │ ├── paths.py # path_* algebra
│ │ ├── fileops.py # file_*
│ │ ├── dirops.py # dir_* (ls/map/walk/info/tree/create/copy/delete)
│ │ ├── linkops.py # link_*
│ │ ├── ids.py # user_ids/group_ids
│ │ ├── temp.py # file_temp stack · path_temp
│ │ └── vectorize.py # polymorphic scalar|iterable dispatch
│ └── _pandas/ # OPTIONAL integration (imported only if pandas present)
│ ├── __init__.py # registers .fs accessor + ExtensionDtypes
│ ├── dtypes.py # BytesDtype, PermsDtype, PathDtype
│ ├── arrays.py # BytesArray, PermsArray, PathArray
│ ├── accessor.py # @register_series_accessor("fs")
│ └── frames.py # build *_info DataFrames with typed columns
└── tests/ # pytest mirror of the package
Module responsibilities¶
| Module | Responsibility | Depends on |
|---|---|---|
_engine/paths.py |
Pure path string algebra (path, path_dir, path_ext*, path_rel, path_norm, …) |
pathlib, os.path |
_engine/fileops.py |
file_create/copy/move/delete/touch/show/chmod/chown/info/size/access |
shutil, os, stat |
_engine/dirops.py |
dir_create/copy/delete/ls/map/walk/info/tree, recursion & filtering |
os.scandir, pathlib |
_engine/linkops.py |
link_create/copy/delete/exists/path |
os |
_engine/ids.py |
user_ids/group_ids (POSIX; empty frames on Windows) |
pwd, grp |
_engine/temp.py |
file_temp deterministic stack, path_temp |
tempfile |
_engine/vectorize.py |
Decorator mapping scalar funcs over iterables/Series | — |
fspath.py |
FsPath(str) fluent object; methods delegate to _engine |
_engine, display |
values.py |
Bytes(int), Perms(int) typed scalars |
display |
display.py |
Formatting/parsing: humanize_bytes, parse_bytes, perms_to_str, parse_perms, tidy, LS_COLORS |
stdlib |
_pandas/* |
ExtensionDtypes/arrays, .fs accessor, DataFrame builders |
pandas, reuses display/values |
Invariant: _engine and values/display must never import pandas. The optional layer
depends inward on them, never the reverse — a classic dependency-inversion boundary.
5. The three surfaces in detail¶
5.1 Functional API (R-fs faithful)¶
Mirrors fs's families and names exactly: path_* (pure, no I/O), file_*, dir_*, link_*,
predicates (is_file, is_dir, is_link, …), user_ids/group_ids, temp helpers.
- Predictable returns: verbs return
FsPath(or a list/Series of them); predicates returnboolor a vectorized mapping;file_size→Bytes;*_info→ DataFrame (orlist[dict]). - Safe defaults ported verbatim:
overwrite=False,recursedefaults matchingfs(Falsefor listing,Truefordir_create),all=False,fail=True. recurse: bool | intoverload —True/False/depth, exactly likefs.
5.2 Fluent FsPath¶
FsPath subclasses str — the same choice as R's fs_path ⊂ character and the path
library. Because an FsPath is a string, it drops into any stdlib or third-party API that
expects a path, and serializes cleanly into pandas.
classDiagram
class str {
<<builtin>>
}
class FsPath {
+__truediv__(other) FsPath
+ext() str
+with_ext(ext) FsPath
+dir() FsPath
+name() FsPath
+abs() FsPath
+real() FsPath
+exists() bool
+is_dir() bool
+copy_to(dst) FsPath
+move_to(dst) FsPath
+touch() FsPath
+delete() None
+mkdir(recurse) FsPath
+ls(...) list~FsPath~
+info() DataFrame
}
str <|-- FsPath
FsPath ..> _engine : delegates
Methods return FsPath (or lists thereof) so calls chain: (FsPath("a") / "b").with_ext("txt").copy_to("c").
5.3 pandas .fs accessor + DataFrame returns¶
- A registered Series accessor gives vectorized path algebra over a column:
df["path"].fs.ext(),.dir(),.with_ext("md"),.exists(),.is_dir(). dir_info()/file_info()return a DataFrame whosepath/size/permissionscolumns use the ExtensionDtypes, so the R headline demo translates directly:
(dir_info("pyrfs", recurse=False)
.query("size > '10KB' and type == 'file'")
.sort_values("size", ascending=False))
6. Typed value system¶
Two cooperating tiers, sharing one set of parse/format functions in display.py.
flowchart TD
subgraph fmt["display.py — single source of truth"]
hb["humanize_bytes / parse_bytes"]
pp["perms_to_str / parse_perms"]
ti["tidy (path normalizer)"]
end
subgraph scalars["values.py + fspath.py (always available)"]
b["Bytes(int)"]
p["Perms(int)"]
fpath["FsPath(str)"]
end
subgraph arrays["_pandas/arrays.py (optional)"]
ba["BytesArray / BytesDtype"]
pa["PermsArray / PermsDtype"]
pta["PathArray / PathDtype"]
end
hb --> b --> ba
pp --> p --> pa
ti --> fpath --> pta
Scalar wrappers (pure stdlib, always present)¶
| Type | Subclass of | Construct from | Displays as | Overloads |
|---|---|---|---|---|
Bytes |
int |
int, "10MB", "1.5GiB" |
445.2K |
<,>,== parse string RHS; arithmetic returns Bytes |
Perms |
int |
octal "644", symbolic "u+rw,go+r", int |
rw-r--r-- |
& \| ~ return Perms; == parses string RHS |
FsPath |
str |
any path-like | tidy path (coloured in terminal) | / for join |
Subclassing the builtins mirrors fs's S3-over-atomic-vector design (fs_bytes ⊂ numeric,
fs_perms ⊂ integer, fs_path ⊂ character): a value still behaves like its base type but
remembers what it is and prints for humans.
pandas ExtensionArrays (optional)¶
For each scalar there is a real ExtensionArray/ExtensionDtype so DataFrame columns are
first-class typed:
BytesDtype(name="bytes", backingint64) — elements show445.2K; native>/</==against strings inside.query();sum/min/maxreductions.PermsDtype(name="perms") — elements showrwxr-xr-x.PathDtype(name="path", backing object ofFsPath) — tidy display,<fs::path>-style repr.
Implemented with the standard protocol (_from_sequence, __getitem__, __len__, isna,
take, copy, _concat_same_type) plus ExtensionScalarOpsMixin for operators, registered via
@register_extension_dtype. They call the same display.py functions as the scalars — no
duplicated formatting logic.
7. Vectorization model¶
R's fs is vectorized end to end. Python is scalar-by-default; pyrfs bridges this with a small
@vectorized decorator in _engine/vectorize.py:
input type → output type
-------------------------------------
str | PathLike | FsPath → scalar (FsPath/Bytes/bool)
list | tuple | set → list
pandas.Series → pandas.Series (only if pandas importable)
This gives file_exists(["a", "b"]) → [bool, bool] and path_ext(series) → Series, while a
single path returns a single value. The .fs accessor is the idiomatic vectorized-over-column
surface; the decorator makes the bare functions polymorphic too.
flowchart LR
inp["caller input"] --> dec{"@vectorized<br/>dispatch on type"}
dec -->|scalar| s["f(x) -> scalar"]
dec -->|iterable| l["[f(x) for x] -> list"]
dec -->|Series| ser["x.map(f) -> Series"]
8. Error model¶
fs's promise is explicit failure (throw, never a silent FALSE). Python's stdlib already
honors this — os/shutil/pathlib raise OSError subclasses. pyrfs's policy:
- Reuse native exceptions where they fit:
FileNotFoundError,FileExistsError,PermissionError(allOSError).overwrite=Falseon an existing target →FileExistsError(matchesfs). - Add
pyrfs.FsError(Exception)for pyrfs-level validation that has no native equivalent — e.g.globandregexpboth set, recycling length mismatch, bad permission/size literal. Subclasses (FsValueError, …) let callersexceptprecisely, mirroringfs's classedfs_error/invalid_argument. fail=Falsesoftens directory traversals (dir_ls/dir_map/dir_info) from error to warning when a single entry is inaccessible — a direct port offs'sfailknob.
flowchart TD
op["pyrfs operation"] --> k{failure?}
k -->|"OS-level"| oserr["raise FileNotFoundError /<br/>FileExistsError / PermissionError"]
k -->|"bad argument"| fserr["raise pyrfs.FsError subclass"]
k -->|"traversal entry, fail=False"| warn["warnings.warn(), skip entry"]
k -->|none| ok["return typed value (FsPath/Bytes/bool/DataFrame)"]
9. Optional-dependency strategy¶
pandas is an extra (pip install pyrfs[pandas]). The mechanism:
_engineandvalues/displaynever import pandas → core is import-safe without it.pyrfs/__init__.pyattemptsimport pyrfs._pandasinside atry/except ImportError; success registers the.fsaccessor and the ExtensionDtypes.*_infofunctions check a cachedhas_pandas()flag: return a typed DataFrame when present, else a plainlist[dict](still useful, still typed scalars in each row).
This mirrors fs's R philosophy: hard deps minimal (Imports: methods), rich integrations as
Suggests (pillar, vctrs) wired up lazily in .onLoad.
10. Build & tooling¶
- Backend: setuptools (
[build-system] requires = ["setuptools>=68"]). - Layout: flat —
[tool.setuptools.packages.find] where = ["."],include = ["pyrfs*"]. - Env/locking:
uv(uv sync,uv run …). - Python:
requires-python = ">=3.10". - Extras:
pandas = ["pandas>=2.0"], optionalcolor,dev = ["pytest","ruff","mypy"]. - Quality gates:
ruff(lint+format),mypy --strict(noAny,py.typedshipped),pytest(pandas tests guarded byimportorskip, run with and without the extra). - Docstrings: NumPy style on the public API.
11. Representative flow — file_copy("a.txt", dest_dir)¶
sequenceDiagram
participant U as caller
participant F as file_copy (functional API)
participant V as vectorize
participant E as _engine.fileops
participant S as shutil/os
participant D as display.tidy
U->>F: file_copy("a.txt", "out/")
F->>V: dispatch on input shape
V->>E: _copy_one("a.txt", "out/", overwrite=False)
E->>E: resolve dir target -> "out/a.txt"; check exists
alt exists and not overwrite
E-->>U: raise FileExistsError
else
E->>S: shutil.copy2("a.txt", "out/a.txt")
E->>D: tidy("out/a.txt")
D-->>F: FsPath("out/a.txt")
F-->>U: FsPath
end
The same _engine._copy_one backs FsPath.copy_to and any .fs-accessor copy — one engine,
three surfaces.
12. Open questions & notes¶
- Path display colour.
FsPath.__repr__colouring viaLS_COLORSis deferred to a late phase (P6); it must degrade cleanly on non-TTY /NO_COLOR. Default plan: plain until P6. - ExtensionArray scope. Full operator/reduction coverage on
BytesArrayis the heaviest piece; v1 targets comparisons +sum/min/max. Edge cases (groupby aggregations,astyperound-trips) to be pinned down with tests in P5. - Windows specifics.
user_ids/group_idsreturn empty frames (nopwd/grp); symlink creation may require privilege. Tidy paths always use/. To be verified on a Windows runner. path_expandsemantics.fsdistinguishespath_expandvspath_expand_r; pyrfs maps the former toos.path.expanduserand will document any divergence rather than hide it.dir_move. Likefs, pyrfs intentionally has nodir_move— directories move viafile_move.