Intel won the storage wars. They invented a storage technology in 2015 that was the best of everything: almost as fast as (then) RAM, basically infinite write endurance in any normal use, and fairly cheap. They even made a brilliant config on m.2 with integrated supercap for power-failure write flush. Just awesome and absolutely the write tech for modern file systems like ZFS. It is perfect for SLOGs. You wish you had a laptop that booted off an Optane m.2 You wish your desktop drives were all NVME Optane.
Well, wishes are all we got left, sadly. Optane, RIP 2022.
You can still buy optane parts on the secondary markets and it seems some of the enterprise DC products are at least still marked current on Intel’s website, but all retail stocks seem to be gone.
Man was that an amazing deal at $0.50/GB. In my application, the only practical form factor was M.2 and even that was a bit wonky in an HP DL360 G9, but more on that later. There are a variety of options and most are available on the used market:
PN | Intro | Cap GB | Write MB/s | write k iops | PBW endurace | PLP | $ (market, 2024) |
MEMPEK1W016GAXT | Q1’17 | 16 | 145 | 35 | 0.2 | NO | 5 |
SSDPEL1K100GA | Q1’19 | 100 | 1,000 | 250 | 10.9 | YES | 109 |
SSDPEL1K200GA01 | Q1’19 | 200 | 2,000 | 400 | 21.9 | YES | 275 |
SSDPEL1K375GA | Q1’19 | 375 | 2,200 | 550 | 41 | YES | 800/1,333/NA |
SSDPEK1A058GA | Q2’22 | 58 | 890 | 224 | 635 | YES | 32/140 |
SSDPEK1A118GA01 | Q2’22 | 118 | 1050 | 243 | 1292 | YES | 70/229 |
Any of these would be a good choice for a SLOG on rotating media, but the later ones are just insane in terms of performance, and that’s compared to enterprise SSDs. They pricing cratered after they were canceled and dangit, didn’t get em. The used market has gone way up, better price increase than bitcoin over the same period and they’re not virtual beanie babies! The SSDPEL1K100GA is the best deal at the moment and has a beefy supercap for power continuity and is still $818 on Amazon, apparently introduced at $1,170. This pricing might have explained why Optane didn’t do better. The 375 GB M.2 would be an awfully nice find at $0.50/GB, that’d be a pretty solid laptop boot disk.
Hardware
For SLOG you really want two devices mirrored in case one fails. The risk of an optane DC grade device failing is trivial and given it has Power Loss Protection, the most likely cause of failure and why your main array failed to write out the transactions committed to the SLOG, we’re really talking about media failure and as it is 3D X-Point it is NOT going to wear out like NAND, it’s rational to single-disk it. I almost striped mine but in the end decided against it because that quadruples the fail rate over a single device and 8x over mirrored and I don’t really need the space.
So how do you install two M.2 devices in a computer that doesn’t have M.2 slots on the mobo? With a PCI card, of course. But wait, you want two in a slot, right? And these are x4 devices, the slots are x8 or x16, so two should be able to pair, right?
Not so fast. Welcome to the bizarre world of PCI furcation. If you want to add two drives to the core PCI bus, you have to split the bus to address the cards. Some mobos support this and others do not. As shipped, the HPE DL360 G9 did not.
BUT, a firmware update, v 1.60 (April 2016) added “support to configure the system to bifurcate PCIe Slot 1 on the DL360 Gen9 or PCIe Slot 2 on the DL380 Gen9.” W00t. A simple Supermicro AOC-SLG3-2M2 supports 2x M.2 cards and only requires bifurcation to work, all good.
Not so fast. In order to pack the DL360 G9 with 2.5 SSDs, you need a Smart Array Controller (set for passthru for ZFS) and that sits in slot 1 and while I believe it can go in any X16 slot, the cabling is not compatible and that’s a lotta SAS cables to replace. Bifurcation on the mobo is out.
But you can fucate on a PCI card just as well – likely this adds some latency and it’d be interest to perf test against more direct connections. I ended up choosing a RIITOP dual M.2×22110 PCI card and it worked out of the box transparently, both disks showed and while I’m not getting 250,000 IOPS, performance is good. It is based on the ASMedia ASM2812, seems like a reasonable chip used in a lot of the lower cost devices of this type, most with 4x M.2 slots instead of 2.
Software
FreeBSD recognizes the devices and addresses them with nvmecontrol
. You can pull a full status report with, for example nvmecontrol identify nvme0
, which provides information on the device or nvmecontrol identify nvme0ns1
which gives details about the storage configuration, including something important (foreshadowing) the LBA format (probably #00, 512).
Current LBA Format: LBA Format #00 ... LBA Format #00: Data Size: 512 Metadata Size: 0 Performance: Good LBA Format #01: Data Size: 512 Metadata Size: 8 Performance: Good LBA Format #02: Data Size: 512 Metadata Size: 16 Performance: Good LBA Format #03: Data Size: 4096 Metadata Size: 0 Performance: Best LBA Format #04: Data Size: 4096 Metadata Size: 8 Performance: Best LBA Format #05: Data Size: 4096 Metadata Size: 64 Performance: Best LBA Format #06: Data Size: 4096 Metadata Size: 128 Performance: Best
The first thing I’d do with a used device is wipe it:
gpart destroy -F /dev/nvme0 gpart destroy -F /dev/nvme1
I would not bother formatting the device to LBA 03/4k. Everyone tells you you should, but you don’t get much of a performance increase and it is a huge pain because nvmecontrol
currently times out after 60 seconds (at least until the patch needed is pushed to kernel or you recompile your kernel with some fixes) if you did want to try, you’d run:
# time nvmecontrol format -f 3 -m 0 -p 0 -l 0 nvme0 316.68 real 0.00 user 0.00 sys (no errors)
-f 3
sets LBA Format #03, 4096 which should give “Performance: Best
” which certainly sounds better than “Good
.”
But it’ll error out. You need to mod /usr/src/sys/dev/nvme/nvme_private.h
with the below modifications and recompile the kernel so it won’t time out after 60 seconds.
#define NVME_ADMIN_TIMEOUT_PERIOD (600) /* in seconds def 60 */ #define NVME_DEFAULT_TIMEOUT_PERIOD (600) /* in seconds def 30 */ #define NVME_MIN_TIMEOUT_PERIOD (5) #define NVME_MAX_TIMEOUT_PERIOD (600) /* in seconds def 120 */
Performance Aside
I tested 512 vs 4k in my system – and perhaps the AIC’s bridge latency or the whole system’s performance so limited the performance of the optane cards that a no difference would appear, these cards do rock at the hardware level (this is with 4k formatting):
# nvmecontrol perftest -n 32 -o read -s 4096 -t 30 nvme0ns1 && nvmecontrol perftest -n 32 -o write -s 4096 -t 30 nvme0ns1 Threads: 32 Size: 4096 READ Time: 30 IO/s: 598310 MB/s: 2337 Threads: 32 Size: 4096 WRITE Time: 30 IO/s: 254541 MB/s: 994
That’s pretty darn close to what’s on the label.
However, testing 512 vs. 4k formatting at the OS level (didn’t test raw) it was a less extraordinary story:
LBA/FW ver. | 4k E2010650 | 512 E2010650 | 4k E2010485 | 512 E2010600 |
Median Mb/s | 759.20 | 762.30 | 757.50 | 742.80 |
Average Mb/s | 721.70 | 722.87 | 721.64 | 724.35 |
Definitely not +10%
So I wouldn’t bother reformatting them myself. Testing a few configurations with
fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=64k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
I get
Device\Metrics | Max IOPS | Avg WBW MiB/s | avg SLAT µS | avg LAT µS |
10 SAS SSD ZFS Z2 Array | 20,442 | 1,135 | 4,392 | 53.94 |
Optane 100G M.2 Mirror | 20,774 | 624 | 3,821 | 95.77 |
tmpfs RAM disk | 23,202 | 1,465 | 6.67 | 42 |
Optane is performing pretty close to the system limit by most metrics – the SLAT and LAT metrics are highly dependent on software.
Formatting
I did something a bit funky since 100GB is way more than this little server could ever use for SLOG. I set it at 16GB which is probably 4x overkill, then used the rest as /var mountpoints for my jails because the optanes have basically infinite write endurance and the log files in var get the most writes on the system. I’m not going into much detail on this because it’s my own weird thing and chances anyone else cares is pretty small.
Initialize GPT
gpart create -s gpt nda0 gpart create -s gpt nda1
Create Partitions
gpart add -b 2048 -s 16g -t freebsd-zfs -a 4k -l slog0 nda0 gpart add -b 2048 -s 16g -t freebsd-zfs -a 4k -l slog1 nda1 gpart add -s 74g -t freebsd-zfs -a 4k -l ovar0 nda0 gpart add -s 74g -t freebsd-zfs -a 4k -l ovar1 nda1
ZPool Operations
zpool add zroot log mirror nda0p1 nda1p1 zpool create optavar mirror nda0p2 nda1p2 zpool set autotrim=on optavar
Create Datasets
zfs create -o mountpoint=/usr/local/jails/containers/jail/var -o compression=on -o exec=off -o atime=off -o setuid=off optavar/jail-var etc
Leave a Reply
You must be logged in to post a comment.