Core Concepts

This guide introduces the three foundational technologies that power the MYRA Stack. Understanding these concepts will help you get the most out of our libraries and make informed architectural decisions.

Foreign Function & Memory API (FFM)

What is FFM?

The Foreign Function & Memory (FFM) API is a modern Java feature (finalized in Java 22) that enables:

Direct memory access outside the JVM heap (off-heap memory)
Native function calls without JNI boilerplate
Structured memory layouts with type-safe access

Previously, Java developers had to choose between:

JNI (complex, error-prone, requires native compilation)
Unsafe (internal API, no guarantees, may break)
ByteBuffer (limited, lacks structured access)

FFM provides a clean, safe, and performant alternative.

Why Off-Heap Memory Matters

Java’s garbage collector is excellent for general workloads, but for high-performance scenarios:

Challenge	Impact	FFM Solution
GC pauses	Unpredictable latency spikes	Off-heap data avoids GC entirely
Object headers	12-16 bytes overhead per object	Direct memory has zero overhead
Memory fragmentation	Inefficient memory usage	Contiguous allocation
Cache locality	Poor performance for large datasets	Controlled memory layout

FFM Key Concepts

1. Memory Segments

A MemorySegment represents a contiguous region of memory:

// Allocate 1KB of off-heap memory
try (Arena arena = Arena.ofConfined()) {
    MemorySegment segment = arena.allocate(1024);
    
    // Write directly to memory
    segment.set(ValueLayout.JAVA_LONG, 0, 42L);
    segment.set(ValueLayout.JAVA_DOUBLE, 8, 3.14159);
    
    // Read back
    long value = segment.get(ValueLayout.JAVA_LONG, 0);
    // Memory automatically freed when arena closes
}

2. Arenas (Lifecycle Management)

Arenas control when memory is released:

Arena Type	Lifecycle	Thread Safety	Use Case
`Arena.ofConfined()`	Explicit close	Single thread	Request-scoped data
`Arena.ofShared()`	Explicit close	Multi-thread	Shared buffers
`Arena.ofAuto()`	GC-managed	Multi-thread	Long-lived pools
`Arena.global()`	Never freed	Multi-thread	Static data

3. Memory Layouts

Define structured data with compile-time safety:

// Define a C-like struct
MemoryLayout orderLayout = MemoryLayout.structLayout(
    ValueLayout.JAVA_LONG.withName("orderId"),
    ValueLayout.JAVA_LONG.withName("timestamp"),
    ValueLayout.JAVA_INT.withName("quantity"),
    ValueLayout.JAVA_INT.withName("price"),
    ValueLayout.JAVA_BYTE.withName("side")
);

// Create type-safe accessors
VarHandle orderIdHandle = orderLayout.varHandle(
    PathElement.groupElement("orderId")
);
VarHandle quantityHandle = orderLayout.varHandle(
    PathElement.groupElement("quantity")
);

// Access fields by name
orderIdHandle.set(segment, 0, 12345L);
int qty = (int) quantityHandle.get(segment, 0);

4. Native Function Calls (Downcalls)

Call C library functions directly:

// Get a handle to the native 'strlen' function
Linker linker = Linker.nativeLinker();
MethodHandle strlen = linker.downcallHandle(
    linker.defaultLookup().find("strlen").orElseThrow(),
    FunctionDescriptor.of(ValueLayout.JAVA_LONG, ValueLayout.ADDRESS)
);

// Call it with a native string
try (Arena arena = Arena.ofConfined()) {
    MemorySegment str = arena.allocateFrom("Hello, FFM!");
    long len = (long) strlen.invokeExact(str);  // Returns 11
}

MVP.Express FFM Usage

Roray FFM Utils builds on these primitives to provide:

MemorySegmentPool - GC-free buffer pooling with metrics
Utf8View - Zero-allocation string comparisons
DowncallFactory - Simplified native function binding
BinaryReader/BinaryWriter - Efficient structured I/O

Zero-Copy I/O

What is Zero-Copy?

Zero-copy means transferring data without copying it between buffers. In traditional I/O:

Application → JVM Heap → Direct Buffer → Kernel Buffer → Network
            (copy 1)      (copy 2)         (copy 3)

With zero-copy:

Application Buffer → Kernel → Network
                  (no copies)

Why Copies Are Expensive

Each memory copy has costs:

Cost Type	Impact
CPU cycles	~1 cycle per byte copied
Memory bandwidth	Saturates memory bus
Cache pollution	Evicts useful data from L1/L2/L3
Latency	Adds microseconds per copy

For a 64KB message with 3 copies:

64KB × 3 copies = 192KB of memory traffic
At 50GB/s memory bandwidth = ~4μs overhead

Zero-Copy Techniques in MVP.Express

1. Flyweight Pattern (MyraCodec)

Instead of deserializing into objects, access data in-place:

// Traditional approach (allocates objects)
Order order = deserialize(buffer);  // Creates Order, String, etc.
long id = order.getId();
String symbol = order.getSymbol();

// Flyweight approach (zero allocation)
OrderFlyweight flyweight = new OrderFlyweight();
flyweight.wrap(segment, offset);
long id = flyweight.getOrderId();        // Direct memory read
Utf8View symbol = flyweight.getSymbol(); // Returns view, not String

The flyweight wraps the binary data and provides accessors that read directly from the underlying memory.

2. Buffer Pools (Roray FFM Utils)

Pre-allocate buffers and reuse them:

// Create a pool of 256 buffers, 4KB each
MemorySegmentPool pool = new MemorySegmentPool(4096, 256, 512);

// Acquire a buffer (no allocation after warmup)
MemorySegment buffer = pool.acquire();
try {
    // Use buffer for I/O
    processData(buffer);
} finally {
    // Return to pool (no deallocation)
    pool.release(buffer);
}

3. Registered Buffers (MyraTransport)

Pre-register buffers with the kernel to eliminate address validation:

// Without registered buffers:
// Kernel must validate buffer address on every I/O operation

// With registered buffers:
// Buffers validated once at registration, kernel uses index
RegisteredBufferPool pool = new RegisteredBufferPoolImpl(config);
backend.registerBufferPool(pool);

// I/O operations use buffer index instead of address
backend.receive(registeredBuffer, token);  // ~1.7x faster

4. View-Based String Handling

Compare strings without allocation:

Utf8View symbolView = flyweight.getSymbol();

// Zero-allocation comparison
if (symbolView.equalsString("AAPL")) {
    // Match found - no String objects created
}

// Only allocate when you really need a String
String symbol = symbolView.toString();  // Allocation happens here

Zero-Copy Best Practices

Pool everything - Buffers, flyweights, views
Avoid toString() - Use views for comparisons
Size buffers appropriately - Match typical message sizes
Align data - Memory alignment improves access speed
Batch operations - Amortize any unavoidable copies

Linux io_uring

What is io_uring?

io_uring is a Linux kernel interface (5.1+) for asynchronous I/O that provides:

Batched system calls - Submit multiple operations in one syscall
True async I/O - Operations complete independently
Zero-copy receives - Kernel writes directly to user buffers
SQPOLL mode - Kernel thread polls for submissions (no syscalls)

Traditional I/O vs io_uring

Traditional (epoll + non-blocking):

For each operation:
  1. epoll_wait()     → syscall
  2. read()/write()   → syscall
  3. Handle EAGAIN    → retry

io_uring:

Queue N operations:
  1. io_uring_submit() → single syscall
  2. io_uring_wait()   → single syscall (gets M completions)

With SQPOLL mode, even the submit syscall is eliminated.

io_uring Architecture

┌────────────────────────────────────────────────────────┐
│                    User Space                          │
│  ┌──────────────┐        ┌──────────────┐             │
│  │ Submission   │        │ Completion   │             │
│  │ Queue (SQ)   │        │ Queue (CQ)   │             │
│  │              │        │              │             │
│  │ [SQE][SQE]   │        │ [CQE][CQE]   │             │
│  │ [SQE][SQE]   │        │ [CQE][CQE]   │             │
│  └──────┬───────┘        └──────▲───────┘             │
│         │                       │                      │
├─────────┼───────────────────────┼──────────────────────┤
│         ▼       Kernel          │                      │
│  ┌──────────────────────────────┴───────┐              │
│  │          io_uring Instance           │              │
│  │  ┌─────────────────────────────────┐ │              │
│  │  │  Registered Buffers (optional)  │ │              │
│  │  │  [buf0][buf1][buf2][buf3]...    │ │              │
│  │  └─────────────────────────────────┘ │              │
│  │                                      │              │
│  │  ┌─────────────────────────────────┐ │              │
│  │  │  SQPOLL Thread (optional)       │ │              │
│  │  │  Polls SQ without syscalls      │ │              │
│  │  └─────────────────────────────────┘ │              │
│  └──────────────────────────────────────┘              │
└────────────────────────────────────────────────────────┘

Key io_uring Features Used by MVP.Express

1. Registered Buffers

Pre-validated memory regions eliminate per-operation address checks:

// Register buffers with kernel at startup
backend.registerBufferPool(pool);

// All subsequent I/O uses buffer indices
// Kernel skips address validation → 1.7x throughput improvement

2. Batch Submission

Submit multiple operations with one syscall:

// Queue multiple operations
backend.receive(buffer1, token1);
backend.receive(buffer2, token2);
backend.send(buffer3, token3);

// Single syscall submits all
int submitted = backend.submitBatch();  // Returns 3

3. Multi-Shot Receive

Keep a receive operation active across multiple completions:

// Traditional: resubmit after each receive
while (running) {
    backend.receive(buffer, token);    // Submit
    backend.submitBatch();             // Syscall
    int n = backend.waitForCompletion(...);  // Get result
}

// Multi-shot: submit once, receive many
backend.receiveMultishot(buffer, token);  // Submit once
backend.submitBatch();                     // One syscall
while (running) {
    // Each completion automatically rearms the receive
    int n = backend.waitForCompletion(...);
}

4. SQPOLL Mode

Dedicated kernel thread polls for submissions:

TransportConfig config = TransportConfig.builder()
    .sqPollEnabled(true)          // Enable SQPOLL
    .sqPollCpuAffinity(3)         // Pin to CPU 3
    .sqPollIdleTimeout(500)       // Sleep after 500μs idle
    .build();

// With SQPOLL:
// - No syscall for io_uring_submit()
// - Kernel thread continuously polls SQ
// - Sub-microsecond submission latency

5. Zero-Copy Send

Avoid user→kernel copy for large payloads:

// Regular send: copies data to kernel
backend.send(buffer, length, token);

// Zero-copy send: kernel reads directly from user buffer
backend.sendZeroCopy(buffer, length, token);
// Note: Two completions - send complete + notification
// Buffer must not be modified until notification received

io_uring Performance Benefits

Metric	Traditional (epoll)	io_uring	Improvement
Syscalls per op	2-3	0.1-0.5	4-30x fewer
p50 latency	25-50μs	5-15μs	3-5x lower
p99 latency	100-500μs	20-50μs	5-10x lower
Throughput	200-500K msg/s	1-2M msg/s	3-5x higher

How MVP.Express Combines These

The MVP.Express stack integrates FFM, zero-copy, and io_uring at every layer:

┌─────────────────────────────────────────────────────────────┐
│                     Your Application                         │
│  • Works with typed flyweights and views                    │
│  • No manual memory management                              │
│  • Zero-GC hot path                                         │
└────────────────────────┬────────────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────────────┐
│                      MyraCodec                               │
│  • Schema-driven code generation                            │
│  • Flyweight accessors (FFM MemorySegment)                  │
│  • Zero-copy encode/decode                                  │
└────────────────────────┬────────────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────────────┐
│                    MyraTransport                             │
│  • io_uring backend with registered buffers                 │
│  • Batch submission and multi-shot receive                  │
│  • SQPOLL for minimum latency                               │
└────────────────────────┬────────────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────────────┐
│                   Roray FFM Utils                            │
│  • MemorySegmentPool for buffer management                  │
│  • Utf8View for zero-alloc string handling                  │
│  • DowncallFactory for native bindings                      │
└─────────────────────────────────────────────────────────────┘

Data Flow Example

A message arriving from the network:

MyraTransport receives data into a registered buffer (zero-copy from kernel)
Roray FFM Utils provides the MemorySegment wrapping the buffer
MyraCodec flyweight wraps the segment for structured access
Application reads fields via flyweight (direct memory reads, no allocation)
Application uses Utf8View.equalsString() for routing (no String allocation)
Response written via flyweight directly to send buffer
MyraTransport sends via io_uring (zero-copy to kernel with SEND_ZC)

Result: End-to-end processing with zero heap allocations, minimal syscalls, and no data copies.

Next Steps

Now that you understand the core concepts:

Roray FFM Utils Guide - Deep dive into memory management
MyraCodec Guide - Schema design and code generation
MyraTransport Guide - io_uring configuration and tuning
Benchmarks - See the performance benefits in action