Optane, a modern technology tragedy (plus FreeBSD nvmecontrol)

Sunday, January 5, 2025 

Intel won the storage wars.  They invented a storage technology in 2015 that was the best of everything: almost as fast as (then) RAM, basically infinite write endurance in any normal use, and fairly cheap.  They even made a brilliant config on m.2 with integrated supercap for power-failure write flush. Just awesome and absolutely the write tech for modern file systems like ZFS. It is perfect for SLOGs.  You wish you had a laptop that booted off an Optane m.2  You wish your desktop drives were all NVME Optane.

Well, wishes are all we got left, sadly.  Optane, RIP 2022.

You can still buy optane parts on the secondary markets and it seems some of the enterprise DC products are at least still marked current on Intel’s website, but all retail stocks seem to be gone.

camelcamelcamel.com price history of intel optane P1600X 118GB

Man was that an amazing deal at $0.50/GB.  In my application, the only practical form factor was M.2 and even that was a bit wonky in an HP DL360 G9, but more on that later.  There are a variety of options and most are available on the used market:

PN Intro Cap GB Write MB/s write k iops  PBW endurace PLP $ (market, 2024)
MEMPEK1W016GAXT Q1’17 16 145 35 0.2 NO 5
SSDPEL1K100GA Q1’19 100 1,000 250 10.9 YES 109
SSDPEL1K200GA01 Q1’19 200 2,000 400 21.9 YES 275
SSDPEL1K375GA Q1’19 375 2,200 550 41 YES 800/1,333/NA
SSDPEK1A058GA Q2’22 58 890 224 635 YES 32/140
SSDPEK1A118GA01 Q2’22 118 1050 243 1292 YES 70/229

Any of these would be a good choice for a SLOG on rotating media, but the later ones are just insane in terms of performance, and that’s compared to enterprise SSDs.  They pricing cratered after they were canceled and dangit, didn’t get em. The used market has gone way up, better price increase than bitcoin over the same period and they’re not virtual beanie babies! The SSDPEL1K100GA is the best deal at the moment and has a beefy supercap for power continuity and is still $818 on Amazon, apparently introduced at $1,170.  This pricing might have explained why Optane didn’t do better. The 375 GB M.2 would be an awfully nice find at $0.50/GB, that’d be a pretty solid laptop boot disk.

Hardware

For SLOG you really want two devices mirrored in case one fails.  The risk of an optane DC grade device failing is trivial and given it has Power Loss Protection, the most likely cause of failure and why your main array failed to write out the transactions committed to the SLOG, we’re really talking about media failure and as it is 3D X-Point it is NOT going to wear out like NAND, it’s rational to single-disk it.  I almost striped mine but in the end decided against it because that quadruples the fail rate over a single device and 8x over mirrored and I don’t really need the space.

So how do you install two M.2 devices in a computer that doesn’t have M.2 slots on the mobo?  With a PCI card, of course.  But wait, you want two in a slot, right?  And these are x4 devices, the slots are x8 or x16, so two should be able to pair, right?

Not so fast.  Welcome to the bizarre world of PCI furcation. If you want to add two drives to the core PCI bus, you have to split the bus to address the cards.  Some mobos support this and others do not.  As shipped, the HPE DL360 G9 did not.

BUT, a firmware update, v 1.60 (April 2016) added “support to configure the system to bifurcate PCIe Slot 1 on the DL360 Gen9 or PCIe Slot 2 on the DL380 Gen9.” W00t. A simple Supermicro AOC-SLG3-2M2 supports 2x M.2 cards and only requires bifurcation to work, all good.

PCIE bifurcation DL360 service menu dual x8

Not so fast. In order to pack the DL360 G9 with 2.5 SSDs, you need a Smart Array Controller (set for passthru for ZFS) and that sits in slot 1 and while I believe it can go in any X16 slot, the cabling is not compatible and that’s a lotta SAS cables to replace. Bifurcation on the mobo is out.

Dual SSD PEL1k100GA in Supermicro AOC-SLG3-2M2 PCI Adapter

But you can fucate on a PCI card just as well – likely this adds some latency and it’d be interest to perf test against more direct connections. I ended up choosing a RIITOP dual M.2×22110 PCI card and it worked out of the box transparently, both disks showed and while I’m not getting 250,000 IOPS, performance is good.  It is based on the ASMedia ASM2812, seems like a reasonable chip used in a lot of the lower cost devices of this type, most with 4x M.2 slots instead of 2.

Software

FreeBSD recognizes the devices and addresses them with nvmecontrol.  You can pull a full status report with, for example nvmecontrol identify nvme0, which provides information on the device or nvmecontrol identify nvme0ns1 which gives details about the storage configuration, including something important (foreshadowing) the LBA format (probably #00, 512).

Current LBA Format:          LBA Format #00
...
LBA Format #00: Data Size:   512  Metadata Size:     0  Performance: Good
LBA Format #01: Data Size:   512  Metadata Size:     8  Performance: Good
LBA Format #02: Data Size:   512  Metadata Size:    16  Performance: Good
LBA Format #03: Data Size:  4096  Metadata Size:     0  Performance: Best
LBA Format #04: Data Size:  4096  Metadata Size:     8  Performance: Best
LBA Format #05: Data Size:  4096  Metadata Size:    64  Performance: Best
LBA Format #06: Data Size:  4096  Metadata Size:   128  Performance: Best

The first thing I’d do with a used device is wipe it:

gpart destroy -F /dev/nvme0
gpart destroy -F /dev/nvme1

I would not bother formatting the device to LBA 03/4k.  Everyone tells you you should, but you don’t get much of a performance increase and it is a huge pain because nvmecontrol currently times out after 60 seconds (at least until the patch needed is pushed to kernel or you recompile your kernel with some fixes) if you did want to try, you’d run:

# time nvmecontrol format -f 3 -m 0 -p 0 -l 0 nvme0
316.68 real         0.00 user         0.00 sys
(no errors)

-f 3 sets LBA Format #03, 4096 which should give “Performance: Best” which certainly sounds better than “Good.”

But it’ll error out.  You need to mod /usr/src/sys/dev/nvme/nvme_private.h with the below modifications and recompile the kernel so it won’t time out after 60 seconds.

#define NVME_ADMIN_TIMEOUT_PERIOD       (600)    /* in seconds def 60 */
#define NVME_DEFAULT_TIMEOUT_PERIOD     (600)    /* in seconds def 30 */
#define NVME_MIN_TIMEOUT_PERIOD         (5)
#define NVME_MAX_TIMEOUT_PERIOD         (600)    /* in seconds def 120 */

Performance Aside

I tested 512 vs 4k in my system – and perhaps the AIC’s bridge latency or the whole system’s performance so limited the performance of the optane cards that a no difference would appear, these cards do rock at the hardware level (this is with 4k formatting):

# nvmecontrol perftest -n 32 -o read -s 4096 -t 30 nvme0ns1 &&  nvmecontrol perftest -n 32 -o write -s 4096 -t 30 nvme0ns1
Threads: 32 Size:   4096  READ Time:  30 IO/s:  598310 MB/s: 2337
Threads: 32 Size:   4096 WRITE Time:  30 IO/s:  254541 MB/s:  994

That’s pretty darn close to what’s on the label.

However, testing 512 vs. 4k formatting at the OS level (didn’t test raw) it was a less extraordinary story:

LBA/FW ver. 4k E2010650 512 E2010650 4k E2010485 512 E2010600
Median  Mb/s 759.20 762.30 757.50 742.80
Average Mb/s 721.70 722.87 721.64 724.35

Definitely not +10%

SLOG performance test on Optane SSDPEL1K100GA

So I wouldn’t bother reformatting them myself.  Testing a few configurations with

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=64k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1

I get

Device\Metrics Max IOPS Avg WBW MiB/s avg SLAT µS avg LAT µS
10 SAS SSD ZFS Z2 Array 20,442 1,135 4,392 53.94
Optane 100G M.2 Mirror 20,774 624 3,821 95.77
tmpfs RAM disk 23,202 1,465 6.67 42

Optane is performing pretty close to the system limit by most metrics – the SLAT and LAT metrics are highly dependent on software.

 Formatting

I did something a bit funky since 100GB is way more than this little server could ever use for SLOG.  I set it at 16GB which is probably 4x overkill, then used the rest as /var mountpoints for my jails because the optanes have basically infinite write endurance and the log files in var get the most writes on the system.  I’m not going into much detail on this because it’s my own weird thing and chances anyone else cares is pretty small.

Initialize GPT

gpart create -s gpt nda0
gpart create -s gpt nda1

Create Partitions

gpart add -b 2048 -s 16g -t freebsd-zfs -a 4k -l slog0 nda0
gpart add -b 2048 -s 16g -t freebsd-zfs -a 4k -l slog1 nda1
gpart add -s 74g -t freebsd-zfs -a 4k -l ovar0 nda0
gpart add -s 74g -t freebsd-zfs -a 4k -l ovar1 nda1

ZPool Operations

zpool add zroot log mirror nda0p1 nda1p1
zpool create optavar mirror nda0p2 nda1p2
zpool set autotrim=on optavar

Create Datasets

zfs create -o mountpoint=/usr/local/jails/containers/jail/var -o compression=on -o exec=off -o atime=off -o setuid=off optavar/jail-var
etc
Posted at 18:32:50 GMT-0700

Category: FreeBSDHowToPositiveReviewsTechnology