Architecture
HPCSeries Core v0.7 — Logical Architecture
Architectural Intent
HPCSeries Core v0.7 is a CPU-first, high-performance numeric kernel engine designed to provide:
Deterministic, cache-efficient time-series and array analytics
Robust statistics resilient to outliers and missing data
SIMD-vectorized and OpenMP-parallel execution paths
Stable C ABI with Fortran and C++ implementations
Python bindings for data-science workflows
The architecture is layered, composable, and backend-agnostic (GPU intentionally excluded from v1.x).
High-Level Layer Stack
┌─────────────────────────────────────────────┐
│ Python API (v0.7) │
│ - hpcs.sum, mean, std │
│ - rolling_* │
│ - robust_* │
│ - masked_* │
│ - axis_* │
└─────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────┐
│ Stable C ABI (hpcs_core.h) │
│ - ISO_C_BINDING compliant │
│ - status-out error handling │
│ - versioned, frozen for v1.x │
└─────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────┐
│ C++ High-Performance Extensions │
│ - Fast rolling median / MAD (O(n log w)) │
│ - STL-based heaps / multisets │
│ - Used where asymptotics beat SIMD │
└─────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────┐
│ Fortran Numeric Kernel Core │
│ - SIMD inner loops (AVX2) │
│ - OpenMP parallel variants │
│ - Masked, robust, axis-aware kernels │
└─────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────┐
│ Foundational Utilities & Constants │
│ - Status codes │
│ - NaN/Inf handling │
│ - Memory-safe helpers │
└─────────────────────────────────────────────┘
Fortran Kernel Subsystems
Foundational Layer
Purpose: Shared infrastructure used by all kernels.
Modules:
hpcs_constantsStatus codes: SUCCESS, INVALID_ARGS, NUMERIC_FAILURE
Precision definitions
hpcs_core_utilsfill, copy, where
forward/backward fill
min–max normalization
This layer has no dependencies except the Fortran runtime.
1D Time-Series Kernels
Purpose: Fast, cache-friendly analytics on single vectors.
Modules:
hpcs_core_1drolling_sum, rolling_mean, rolling_variance, rolling_std
zscore
hpcs_core_reductionsreduce_sum / min / max / mean / variance / std
hpcs_core_prefixprefix_sum (inclusive / exclusive)
Characteristics:
Tight loops
SIMD-vectorized
Optional OpenMP variants
Deterministic behavior
Robust Statistics Layer
Purpose: Outlier-resilient analytics.
Modules:
hpcs_core_statsmedian (quickselect)
MAD
quantile
hpcs_core_rollingrolling_median
rolling_mad
hpcs_core_qualityrobust_zscore
winsorization
clipping
Design principle: Robust stats build on basic reductions but never mutate core primitives.
Masked Operations Layer
Purpose: Correct analytics in the presence of missing or invalid data.
Modules:
hpcs_core_maskedmasked reductions
masked rolling means
masked robust stats
Key behavior:
Explicit validity mask
Propagates NaNs correctly
Returns numeric failure when no valid data exists
2D & Axis-Aware Kernels
Purpose: Operate on matrices without materializing copies.
Modules:
hpcs_core_batchedcolumn-wise independent 1D processing
hpcs_core_axisaxis-0 (column) reductions
axis-1 (row) reductions
Design choice:
Axis loops outside, SIMD inside
No implicit transposes
Cache-aware traversal
Anomaly Detection Layer
Purpose: Structured anomaly detection built from primitives.
Modules:
hpcs_core_anomalydetect_anomalies (mean/std)
detect_anomalies_robust (median/MAD)
rolling anomaly density
2D rolling anomaly detection
Important: This layer contains no novel math—it is a composition layer.
Parallel & Vectorized Execution
SIMD (v0.6+)
Explicit AVX2 inner loops
4 doubles per vector
Used in:
reductions
rolling sums
z-scores
masked ops
OpenMP
Thread-level parallelism:
large reductions
rolling windows over long arrays
Deterministic reduction variants preserved where required
Note
SIMD ≠ OpenMP: SIMD accelerates inside a core, OpenMP scales across cores. Both are used where appropriate.
C++ Fast-Path Kernels
Why C++ exists in v0.7: Some algorithms (e.g., rolling median) are asymptotically better with tree/heap structures.
Examples:
hpcs_rolling_median_fasthpcs_rolling_mad_fast
These:
Beat SIMD at large window sizes
Are wrapped behind the same C ABI
Are invisible to Python users
Stable C ABI Layer
Design principles:
bind(C)everywhereNo return values (status via pointer)
ABI frozen for v1.x
This enables:
Python bindings
C++ wrappers
Future Rust / Julia bindings
Embedding in non-Python systems
Python Binding Layer (v0.7)
What is exposed:
Core reductions
Rolling statistics
Robust statistics
Masked & axis operations
Fast rolling kernels
What is NOT exposed:
Internal helpers
Experimental tuning hooks
Debug utilities
Python is a consumer, not the owner, of the architecture.
What HPCSeries Core IS (and Is Not)
It IS
A high-performance numeric kernel engine
A foundation for domain-specific cores
A CPU-optimized analytics backend
A “numerical brick” for serious systems
It is NOT
A BLAS replacement
A NumPy clone
A GPU framework (yet)
A modeling / ML library
Architectural Stability
At v0.7:
The core architecture is stable
Future versions add breadth, not depth
Domain engines sit above, not inside, the core
GPU work (if any) becomes a parallel backend, not a rewrite
Design Decisions
Why Hybrid Fortran/C/C++?
Fortran: Best for array operations, excellent OpenMP support
C: SIMD intrinsics, system-level control, portable C ABI
C++: Modern features for complex rolling algorithms (heaps, multisets)
Python/Cython: User-friendly API, NumPy integration
Why Runtime SIMD Dispatch?
Single binary works on all CPUs (SSE2 to AVX-512)
Automatically uses best available instructions
No need for multiple builds
Why Stable C ABI?
Enables bindings to any language
Prevents ABI breakage across versions
Separates interface from implementation
Why No GPU in v1.x?
CPU-first architecture is simpler and more deterministic
GPU support would be a parallel backend, not a rewrite
Focuses resources on CPU optimization first
GPU may be added as backend in future versions
Zero-Copy Design
HPCSeries never copies NumPy array data:
NumPy array in Python → pointer to data
Cython receives same pointer
C receives same pointer
Fortran receives same pointer (via C bridge)
No memory copies!
This is essential for performance with large arrays.
Thread Safety
HPCSeries is thread-safe for read operations:
Multiple threads can call functions simultaneously
No global mutable state
OpenMP handles internal parallelization
GIL released during C/Fortran execution
Not thread-safe:
Calibration (should be run once at startup)
Configuration saving/loading (serialize these operations)
See Also
Performance Guide - Performance optimization guide
Migration Guide - Migrating from NumPy/Pandas
API Reference: API Reference