OpenZFS on OS X - User contributions [en]

Development

2015-08-21T22:53:18Z

101.175.67.14: /* Detecting memory handling errors */

[[Category:O3X development]]
You should also familiarize yourself with the [[Project_roadmap|project roadmap]] so that you can put the technical details here in context.

== Kernel ==

=== Debugging with GDB ===

Dealing with [[Panic|panics]].

Apple's documentation: https://developer.apple.com/library/mac/documentation/Darwin/Conceptual/KEXTConcept/KEXTConceptDebugger/debug_tutorial.html

Boot target VM with

<syntaxhighlight lang="bash">
$ sudo nvram boot-args="-v keepsyms=y debug=0x144"
</syntaxhighlight>

Make it panic.

On your development machine, you will need the Kernel Debug Kit. Download it from Apple [https://developer.apple.com/downloads/index.action?q=Kernel%20Debug%20Kit here].

<syntaxhighlight lang="text">
$ gdb /Volumes/Kernelit/mach_kernel
(gdb) source /Volumes/KernelDebugKit/kgmacros
(gdb) target remote-kdp
(gdb) kdp-reattach 192.168.30.133 # obviously use the IP of your target / crashed VM
(gdb) showallkmods
</syntaxhighlight>

Find the addresses for ZFS and SPL modules.

<code>^Z</code> to suspend gdb, or, use another terminal

<syntaxhighlight lang="bash">
^Z
$ sudo kextutil -s /tmp -n \
-k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ \
../spl/module/spl/spl.kext/
</syntaxhighlight>

Then resume gdb, or go back to gdb terminal.
<syntaxhighlight lang="text">
$ fg
(gdb) set kext-symbol-file-path /tmp
(gdb) add-kext /tmp/spl.kext
(gdb) add-kext /tmp/zfs.kext
(gdb) bt
</syntaxhighlight>

=== Debugging with LLDB ===

<syntaxhighlight lang="bash">
$ echo "settings set target.load-script-from-symbol-file true" >> ~/.lldbinit
$ lldb /Volumes/KernelDebugKit/mach_kernel # From Yosemite, "/Library/Developer/KDKs/KDK_10.10_14B25.kdk/System/Library/Kernels/kernel"
(lldb) kdp-remote 192.168.30.146
(lldb) showallkmods
(lldb) addkext -F /tmp/spl.kext/Contents/MacOS/spl 0xffffff7f8ebb0000 (Address from showallkmods)
(lldb) addkext -F /tmp/zfs.kext/Contents/MacOS/zfs 0xffffff7f8ebbf000
</syntaxhighlight>

Then follow the guide for GDB above.

=== Non-panic ===

If you prefer to work in GDB, you can always panic a kernel with

<syntaxhighlight lang="bash">
$ sudo dtrace -w -n "BEGIN{ panic();}"
</syntaxhighlight>

But this was revealing:

<syntaxhighlight lang="bash">
$ sudo /usr/libexec/stackshot -i -f /tmp/stackshot.log
$ sudo symstacks.rb -f /tmp/stackshot.log -s -w /tmp/trace.txt
$ less /tmp/trace.txt
</syntaxhighlight>

Note that my hang is here:

<syntaxhighlight lang="text">
PID: 156
Process: zpool
Thread ID: 0x4e2
Thread state: 0x9 == TH_WAIT |TH_UNINT
Thread wait_event: 0xffffff8006608a6c
Kernel stack:
machine_switch_context (in mach_kernel) + 366 (0xffffff80002b3d3e)
0xffffff800022e711 (in mach_kernel) + 1281 (0xffffff800022e711)
thread_block_reason (in mach_kernel) + 300 (0xffffff800022d9dc)
lck_mtx_sleep (in mach_kernel) + 78 (0xffffff80002265ce)
0xffffff8000569ef6 (in mach_kernel) + 246 (0xffffff8000569ef6)
msleep (in mach_kernel) + 116 (0xffffff800056a2e4)
0xffffff7f80e52a76 (0xffffff7f80e52a76)
0xffffff7f80e53fae (0xffffff7f80e53fae)
0xffffff7f80e54173 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)
0xffffff7f80f2bb4e (0xffffff7f80f2bb4e)
0xffffff7f80f1a9b7 (0xffffff7f80f1a9b7)
0xffffff7f80f1b65f (0xffffff7f80f1b65f)
0xffffff7f80f042ee (0xffffff7f80f042ee)
0xffffff7f80f45c5b (0xffffff7f80f45c5b)
0xffffff7f80f4ce92 (0xffffff7f80f4ce92)
spec_ioctl (in mach_kernel) + 157 (0xffffff8000320bfd)
VNOP_IOCTL (in mach_kernel) + 244 (0xffffff8000311e84)
</syntaxhighlight>

It is a shame that it only shows the kernel symbols, and not inside SPL and ZFS, but we can ask it to load another sym file. (Alas, it cannot handle multiple symbols files. Fix this Apple.)

<syntaxhighlight lang="bash">
$ sudo kextstat #grab the addresses of SPL and ZFS again
$ sudo kextutil -s /tmp -n -k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ ../spl/module/spl/spl.kext/

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.spl.sym
0xffffff800056a2e4 (0xffffff800056a2e4)
spl_cv_wait (in net.lundman.spl.sym) + 54 (0xffffff7f80e52a76)
taskq_wait (in net.lundman.spl.sym) + 78 (0xffffff7f80e53fae)
taskq_destroy (in net.lundman.spl.sym) + 35 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.zfs.sym
0xffffff7f80e54173 (0xffffff7f80e54173)
vdev_open_children (in net.lundman.zfs.sym) + 336 (0xffffff7f80f1a870)
vdev_root_open (in net.lundman.zfs.sym) + 94 (0xffffff7f80f2bb4e)
vdev_open (in net.lundman.zfs.sym) + 311 (0xffffff7f80f1a9b7)
vdev_create (in net.lundman.zfs.sym) + 31 (0xffffff7f80f1b65f)
spa_create (in net.lundman.zfs.sym) + 878 (0xffffff7f80f042ee)
</syntaxhighlight>

Voilà!

=== Memory leaks ===

(Note that this section is only relevant to old O3X implementation that used the zones allocator - we now use our own kmem allocator).

In some cases, you may suspect memory issues, for instance if you saw the following panic:

<syntaxhighlight lang="text">
panic(cpu 1 caller 0xffffff80002438d8): "zalloc: \"kalloc.1024\" (100535 elements) retry fail 3, kfree_nop_count: 0"@/SourceCache/xnu/xnu-2050.7.9/osfmk/kern/zalloc.c:1826
</syntaxhighlight>

To debug this, you can attach GDB and use the zprint command:

<syntaxhighlight lang="text">
(gdb) zprint
ZONE COUNT TOT_SZ MAX_SZ ELT_SZ ALLOC_SZ TOT_ALLOC TOT_FREE NAME
0xffffff8002a89250 1620133 18c1000 22a3599 16 1000 125203838 123583705 kalloc.16 CX
0xffffff8006306c50 110335 35f000 4ce300 32 1000 13634985 13524650 kalloc.32 CX
0xffffff8006306a00 133584 82a000 e6a900 64 1000 26510120 26376536 kalloc.64 CX
0xffffff80063067b0 610090 4a84000 614f4c0 128 1000 50524515 49914425 kalloc.128 CX
0xffffff8006306560 1070398 121a2000 1b5e4d60 256 1000 72534632 71464234 kalloc.256 CX
0xffffff8006306310 399302 d423000 daf26b0 512 1000 39231204 38831902 kalloc.512 CX
0xffffff80063060c0 100404 6231000 c29e980 1024 1000 22949693 22849289 kalloc.1024 CX
0xffffff8006305e70 292 9a000 200000 2048 1000 77633725 77633433 kalloc.2048 CX
</syntaxhighlight>

In this case, kalloc.256 is suspect.

Reboot kernel with zlog=kalloc.256 on the command line, then we can use

<syntaxhighlight lang="text">
(gdb) findoldest
oldest record is at log index 393:

--------------- ALLOC 0xffffff803276ec00 : index 393 : ztime 21643824 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
and indeed, list any index

(gdb) zstack 394

--------------- ALLOC 0xffffff8032d60700 : index 394 : ztime 21648810 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
How many times was zfs_kmem_alloc involved in the leaked allocs?

(gdb) countpcs 0xffffff7f80e847df
occurred 3999 times in log (100% of records)
</syntaxhighlight>

At least we know it is our fault.

How many times is it arc_buf_alloc?

<syntaxhighlight lang="text">
(gdb) countpcs 0xffffff7f80e90649
occurred 2390 times in log (59% of records)
</syntaxhighlight>

=== Memory Architecture ===

ZFS is designed to aggressively cache filesystem data in main memory. The result of this caching can be significant filesystem performance improvement.

Selection of an allocator has been very challenging on OS X. In the last year we have evolved from:
* Direct call to OSMalloc - a very low level allocator in the kernel - rejected because of slow performance and because the minimum allocation size is one page (4k)
* Direct call to zalloc - the OS X zones allocator - rejected because only 25% of the machines memory can be accessed (50% under some circumstances), and because the result of exceeding this limit is a kernel panic with no other feedback mechanisms available.
* Direct call to bmalloc - bmalloc was a home grown slice allocator that allocated slices of memory from the kernel page allocator, and subdivided into smaller units of allocation to use by ZFS. This was quite successful but very space inefficient. Was used in O3X 1.2.7 and 1.3.0. At this stage we had no real response to memory pressure in the machine, so the total memory allocation to O3X was kept to 50% of the machine.
* Implementation of kmem and vmem allocators using code from Illumos. Provision of a memory pressure monitor mechanism - we are now able to allocate most of the machines memory to ZFS, and scale that back when the machine experiences memory pressure.

O3X has the Solaris Porting Layer (SPL). The SPL has long since provided the Illumos kmem.h API for use by ZFS. In O3X releases up to 1.3.0 the kmem implementation has been a stub that passes allocation requests to an underlying allocator. In O3X 1.3.0 we were still missing some key behaviours in the allocator - efficient lifecycle control of objects, and an effective response to memory pressure in the machine, and the allocator was not very space efficient because of metadata overheads in bmalloc. We were also not convinced that bmalloc represented the state of the art.

Our strategy was to determine how much of the Illumos allocator could be implemented on OS X. After a series of experiments where we implemented significant portions of the kmem code from illumos on top of bmalloc, we had learned enough to take the final step of essentially copying the entire kmem/vmem allocator stack from Illumos. Some portions of the kmem code have been disabled in kmem such as logging, and hot swap CPU support have been disabled due to architectural differences between OS X and Illumos.

By default kmem/vmem require a certain level of performance from the OS page allocator. It is easy to overwhelm the OS X page allocator. We tuned vmem to use 512Kb chunks of memory from the page allocator rather than the smaller allocations that vmem prefers. This is less than ideal as it reduces the ability for vmem to smoothly release memory to the page allocator when the machine is under pressure. While we have an adequately performing solution now, there will always be a tension between our allocator and OS X itself. OS X only provides minimal mechanisms to observe and respond to memory pressure in the machine, so we are somewhat limited in what can be achieved in this regard.

References:

Jeff Bonwicks paper - kmem and vmem implement this design. https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/

=== Detecting memory handling errors ===

The kmem allocator has an internal diagnostic mode. In diagnostic mode the allocator instruments heap memory with various features and markers as it is allocated and released by application code. These markers are checked as the program runs, and can determine when an application has exhibited one or more of a set of common memory handling errors. The debugging mode is disabled by default as it carries a significant performance penalty.

The memory handling errors that can be detected include:
* Modify after free
* Write past end of buffer
* Free of memory not managed by kmem
* Double free of memory
* Various other corruptions
* Freed size != allocated size
* Freed address != allocated address

Debug mode is enabled by compiling with the preprocessor symbol DEBUG defined. At a minimum spl-kmem.c and spl-osx.c need to see this define for the debugging features to be completely enabled.

In debugging mode you must choose whether kmem will log the fault and then panic, or just log. If you elect to panic, there is a very high chance that the full log message will not be stored in system.log before the OS halts, and you will have to connect to the machine with lldb and use the "systemlog" command to view the diagnostic message. If you elect to not panic, the program will continue to run despite the memory corruption, with undefined consequences. In spl-kmem.c set kmem_panic=0 to log, kmem_panic=1 to log+panic.

Example:

I modified spl_start() to include the following:

{
...
int *p;
for(int i=0; i<20;i++) {
p = (int*)spl_kmem_alloc(1024);
spl_kmem_free(p);
*p = 0;
}
}

With the debug mode enabled the following was logged:

14/08/2015 5:09:47.000 PM kernel[0]: SPL: kernel memory allocator: buffer modified after being freed
14/08/2015 5:09:47.000 PM kernel[0]: SPL: modification occurred at offset 0x0 (0xdeadbeefdeadbeef replaced by 0xdeadbeef00000000)
14/08/2015 5:09:47.000 PM kernel[0]: SPL: buffer=0xffffff887a87d980 bufctl=0xffffff887a7ad840 cache: kmem_alloc_1152
14/08/2015 5:09:47.000 PM kernel[0]: SPL: previous transaction on buffer 0xffffff887a87d980:
14/08/2015 5:09:47.000 PM kernel[0]: SPL: thread=0 time=T-0.000001383 slab=0xffffff887a5ffe68 cache: kmem_alloc_1152
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _kmem_cache_free_debug + 0x227
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _kmem_cache_free + 0x173
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _zfs_kmem_free + 0x2c4
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _spl_start + 0x2bb
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext5startEb + 0x40b
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext4loadEhhP7OSArray + 0xdd
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext4loadEhhP7OSArray + 0x3e1
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext22loadKextWithIdentifierEP8OSStringbbhhP7OSArray + 0xf2
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZNK11IOCatalogue14isModuleLoadedEP12OSDictionary + 0xe0
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN9IOService15probeCandidatesEP12OSOrderedSet + 0x2c4
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN9IOService14doServiceMatchEj + 0x22a
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN15_IOConfigThread4mainEPvi + 0x13c
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : _call_continuation + 0x17

You can clearly see the kind of memory corruption, the actual corrupted data, which kmem cache was involved, the relative time that the last action occurred, and the stack trace for the last action (which was a call to zfs_kmem_free()) - indicating that spl_start() was implicated in the fault. This event would have been logged the next time the modified after free buffer was allocated.

=== Compiling to lower OSX versions ===

If you wish to compile O3X to a specific OSX version, in this case, compiling for 10.9 on a 10.10

SPL:
./configure --with-kernel-headers=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

ZFS:
./configure --with-kernelsrc=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

== Flamegraphs ==

Huge thanks to [http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html BrendanGregg] for so much of the dtrace magic.

dtrace the kernel while running a command:

<syntaxhighlight lang="bash">
$ sudo dtrace -x stackframes=100 -n 'profile-997 /arg0/ {
@[stack()] = count(); } tick-60s { exit(0); }' -o out.stacks
</syntaxhighlight>

It will run for 60 seconds.

Convert it to a flamegraph:

<syntaxhighlight lang="bash">
$ ./stackcollapse.pl out.stacks > out.folded
$ ./flamegraph.pl out.folded > out.svg
</syntaxhighlight>

This is <code>rsync -a /usr/ /BOOM/deletea/</code> running:

[[File:rsyncflamegraph.svg|thumb|rsync flamegraph]]

Or running '''Bonnie++''' in various stages:

<gallery mode="packed-hover">
File:create.svg|Create files in sequential order|alt=[[File:create.svg]]
File:stat.svg|Stat files in sequential order|alt=Stat files in sequential order
File:delete.svg|Delete files in sequential order|alt=Delete files in sequential order
</gallery>

[[File:VX_create.svg|thumb|Create files in sequential order]]

 

[[File:iozone.svg|thumb|IOzone flamegraph]]

[[File:iozoneX.svg|thumb|IOzone flamegraph (untrimmed)]]

 

------

== Unit Test ==

We have created an initial port of the standard ZFS test suite. It consists of a collection of scripts and miscellaneous utility programs and exercise the complete breadth and depth of the ZFS filesystem.

The tests are best run in a virtual machine with a baseline configured setup that has been captured in a snapshot. The tests should be run on the VM, and then due to the destructive nature of the tests, the VM should be reverted to the snapshot in preparation for future test runs. The tests take 2-4 hours to run depending on hardware setup.

=== Setup ===

The user zfs-test needs to be able to run sudo without issuing a password. Add the following to sudoers:

zfs-tests ALL=(ALL) NOPASSWD: ALL

The sudo root environment must be configured to pass certain enviroment variables from zfs-test through to the root environment. Add the following to sudoers:

Defaults env_keep += "__ZFS_MAIN_MOUNTPOINT_DIR"

Modify /etc/bashrc to contain

export __ZFS_MAIN_MOUNTPOINT_DIR="/"

If your development directory is ~you/Developer, clone zfs, spl and bfs-tests into that directory

# cd ~you/Developer
# git clone git@github.com:openzfsonosx/zfs-test.git
# git clone git@github.com:openzfsonosx/zfs.git
# git clone git@github.com:openzfsonosx/spl.git

Build the ZFS is built using the building from source instructions.

Ensure that /var/tmp has approximately 100GB of free space.

Create theee virtual hard drives of 10-20GB capacity each.

=== Run Test Suite ===

Setup the tests to run

# cd ~you/Developer/zfs-tests
# ./autogen.sh
# ./configure CC=clang CXX=clang++

Edit the generated Makefile, change the recipe for the test_hw target such that your three virtual disks are listed in the DISKS environment variable.

test_hw: test_verify test/zfs-tests/cmd
@KEEP="`zpool list -H -oname`" \
STF_TOOLS=$(abs_top_srcdir)/test/test-runner/stf \
STF_SUITE=$(abs_top_srcdir)/test/zfs-tests \
DISKS="/dev/disk3 /dev/disk1 /dev/disk2" \
su zfs-tests -c "ksh $(abs_top_srcdir)/test/zfs-tests/cmd/scripts/zfstest.ksh $$RUNFILE"

Run the test suite

sudo make test_hw

=== Results ===

The test suite write summary pass/fail information to the console as they run. On completion of the test run summary statistics are written to the console.

Test log files are stored in /var/tmp/<testrun> (where test run is a unique looking number). In that directory there is a log file, and a directory per test. Within the test directory is detailed log information regarding the specific test.

== Iozone ==

Quick peek at how they compare, just to see how much we should improve it by.

HFS+ and ZFS were created on the same virtual disk in VMware. Of course, this is not ideal testing specs, but should serve as an indicator.

The pool was created with

<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 \
-O atime=off \
-O casesensitivity=insensitive \
-O normalization=formD \
BOOM /dev/disk1
</syntaxhighlight>

and the HFS+ file system was created with the standard OS X Disk Utility.app, with everything default (journaled, case-insensitive).

'''Iozone''' was run with standard automode:

<syntaxhighlight lang="bash">
sudo iozone -a -b outfile.xls
</syntaxhighlight>

[[File:hfs2_read.png|thumb|HFS+ read]]
[[File:hfs2_write.png|thumb|HFS+ write]]
[[File:zfs2_read.png|thumb|ZFS read]]
[[File:zfs2_write.png|thumb|ZFS write]]

As a guess, writes need to double, and reads need to triple.

=== VFS ===

[[VFS]]

== File-based zpools for testing==

* create 2 files (each 100 MB) to be used as block devices:
<syntaxhighlight lang="bash">
$ dd if=/dev/zero bs=1m count=100 of=vdisk1
$ dd if=/dev/zero bs=1m count=100 of=vdisk2
</syntaxhighlight>

* attach files as raw disk images:
<syntaxhighlight lang="bash">
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk1
/dev/disk2
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk2
/dev/disk3
</syntaxhighlight>

* create mirrored zpool:
<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 -O casesensitivity=insensitive -O normalization=formD tank mirror disk2 disk3
</syntaxhighlight>

* show zpool:
<syntaxhighlight lang="bash">
$ sudo zpool status
pool: tank
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
disk2 ONLINE 0 0 0
disk3 ONLINE 0 0 0

errors: No known data errors
</syntaxhighlight>

* test ZFS features, find bugs, ...

* export zpool:
<syntaxhighlight lang="bash">
$ sudo zpool export tank
</syntaxhighlight>

* detach raw images:
<syntaxhighlight lang="bash">
$ hdiutil detach disk2
"disk2" unmounted.
"disk2" ejected.
$ hdiutil detach disk3
"disk3" unmounted.
"disk3" ejected.
</syntaxhighlight>

== Platform differences ==

This section is an attempt to outline the differences from ZFS versions of other platforms, as compared to OS X. To assist developers new to the Apple platform, who wishes to assist, or understand, development of the O3X version.

=== Reclaim ===

One of the biggest hassles with OS X is the VFS layer's handling of reclaim. First it is worth noting that "struct vnode" is an opaque type, so we are not allowed to see, nor modify, the contents of a vnode.
(Of course, we could craft a mirror struct of vnode and tailor it to each OS X version where vnode changes. But that is rather hacky.)

Following that, the '''only''' place where you can set the '''vtype''' (VREG, VDIR), '''vdata''' (user pointer to hold the ZFS znode), '''vfsops''' (list of filesystem calls "vnops") etc, is '''only in calling vnode_create()'''.
So there is no way to "allocate an empty vnode, and set its values later". The FreeBSD method of pre-allocating vnodes, to avoid reclaim, can not be done.
ZFS will start a new dmu_tx, then call zfs_mknode which will eventually call vnode_create, so we can not do anything with dmu_tx in those vnops.

The problem is, if vnode_create decides to reclaim, it will do so directly, as the same thread. It will end up in vclean() which can call vnop_fsync, vnop_pageout, vnop_inactive and vnop_reclaim. The first three of these calls, we can
use the API call vnode_isrecycled() to detect if these vnops are called "the normal way", or from vclean. If we come from vclean, and the vnode is doomed, we will do as little as possible. We can not open a new TX, and
we can not use mutex locks (panic: locking against ourselves).

Nor is there any way to defer, or delay, a doomed vnode. If vnop_reclaim returns anything but 0, you find the lovely XNU code of
2205 if (VNOP_RECLAIM(vp, ctx))
2206 panic("vclean: cannot reclaim");
in vfs_subr.c

So, at the moment there is some extra logic in '''zfs_vnop_reclaim''' to handle that we might be re-entrant as the '''vnode_create''' thread.

exception = ((zp->z_sa_hdl != NULL) &&
zp->z_unlinked) ? B_TRUE : B_FALSE;
fastpath = zp->z_fastpath;

if both exception and fastpath are FALSE, we can call direct reclaim right there. As in those cases, no final dmu_tx is caused. Following
the zfs_rmnode->zfs_purgedir->zget and similar paths, exception is set to TRUE.

If exception is TRUE, we add the zp to the reclaim_list, and the separate reclaim_thread will call zfs_rmnode(zp). As a separate thread it can handle calling
dmu_tx.

If fastpath is TRUE, we do no more/nothing in zfs_vnop_reclaim. See below.

=== Fastpath vs Recycle ===

Another interesting aspect is that IllumOS has a delete fastpath. In zfs_remove, if it is detected that the znode can be "deleted_now", it marks the vnode as free and directly calls zfs_znode_delete(), if it can not, adds it to zfs_unlinked_add().

In OS X, there is no way to directly release a vnode. Ie, XNU always has full control of the vnodes. Even if you call vnode_recycle(), the vnode is not released '''until''' vnop_reclaim is called. The vnode can just be marked for later reclaim, but remain active (especially if you are racing against other threads using the same vnode). So in zfs_remove, we attempt to call vnode_recycle(), and only if this returns "1" do we know that vnop_reclaim was called, and we can directly call zfs_znode_delete(). Note that the O3X vnop_reclaim handler then has special code to not do anything with the vnode (zp->z_fastpath) but to only clear out the z_vnode and return.

zp->z_fastpath = B_TRUE;
if (vnode_recycle(vp) == 1) {
/* recycle/reclaim is done, so we can just release now */
zfs_znode_delete(zp, tx);
} else {
/* failed to recycle, so just place it on the unlinked list */
zp->z_fastpath = B_FALSE;
zfs_unlinked_add(zp, tx);
}

There is also a little special lock-handling in zfs_zinactive, since we can call it from inside of a vnode_create() which is called by ZFS with locks held. If this is the case, we do not attempt to acquire locks in zfs_zinactive.

=== snapshot mounts ===

There is no way to cause a mount in XNU kernel. None. At. All. Apple themselves cheated and added a static nfsmount() that we can not call. So instead, we have to jump through a whole bunch of
hoops to get there. We create a fake/virtual /dev/diskX entry for the snapshot. '''diskarbitrationd''' will wake up due to new disk, it will enter the probe phase, which includes calling
all the /System/Library/Filesystems/ bundles. Eventually, zfs.util is called and we reply affirmative. However, automount is disable here, as there is no way to specify a mountpoint with auto.
zfs.util will call DADiskMount to mount it to the correct directory.

This means we have a few more VNOPs in zfs_ctldir.c, as we have to reply with correct information to make mount successful. The first getattr will cause the mount attempt, the DADiskMount call will cause getattr to be called
and we have to pretend to have said entry.

=== spl_vn_rdwr vs vn_rdwr ===

There are two calls to vn_rdwr() in OSX's SPL. The '''spl_vn_rdwr()''' call needs to be used when zfs_onexit is in use. For example, dmu_send.c (zfs recv/send) and zfs_ioc_diff (zfs diff). The XNU implementation of
zfs_onexit (as in calls to '''getf''' and '''releasef''') need to place the internal XNU ''struct fileproc''' in the wrapper ''struct spl_fileproc'', so that '''spl_vn_rdwr()''' can use it to do IO.
This is the only way to do IO on a non-file based vnode (ie, pipe or socket). Other places that call vn_rdwr(), for example vdev_file.c, needs to call the regular vn_rdwr.

=== getattr ===

XNU has a whole bunch of items that it can ask for in vnop_getattr, including VA_NAME, which is used heavily by Finder (especially in the vfs_vget path). Care is needed here to return the correct name,
including for link (hard links) targets. VNOP_LOOKUP records the name that was used in the lookup, so that a following stat call (vnop_getattr) on the vnode will return the correct name if VA_NAME is requested.

Development

2015-08-21T22:34:57Z

101.175.67.14: /* Run Test Suite */

[[Category:O3X development]]
You should also familiarize yourself with the [[Project_roadmap|project roadmap]] so that you can put the technical details here in context.

== Kernel ==

=== Debugging with GDB ===

Dealing with [[Panic|panics]].

Apple's documentation: https://developer.apple.com/library/mac/documentation/Darwin/Conceptual/KEXTConcept/KEXTConceptDebugger/debug_tutorial.html

Boot target VM with

<syntaxhighlight lang="bash">
$ sudo nvram boot-args="-v keepsyms=y debug=0x144"
</syntaxhighlight>

Make it panic.

On your development machine, you will need the Kernel Debug Kit. Download it from Apple [https://developer.apple.com/downloads/index.action?q=Kernel%20Debug%20Kit here].

<syntaxhighlight lang="text">
$ gdb /Volumes/Kernelit/mach_kernel
(gdb) source /Volumes/KernelDebugKit/kgmacros
(gdb) target remote-kdp
(gdb) kdp-reattach 192.168.30.133 # obviously use the IP of your target / crashed VM
(gdb) showallkmods
</syntaxhighlight>

Find the addresses for ZFS and SPL modules.

<code>^Z</code> to suspend gdb, or, use another terminal

<syntaxhighlight lang="bash">
^Z
$ sudo kextutil -s /tmp -n \
-k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ \
../spl/module/spl/spl.kext/
</syntaxhighlight>

Then resume gdb, or go back to gdb terminal.
<syntaxhighlight lang="text">
$ fg
(gdb) set kext-symbol-file-path /tmp
(gdb) add-kext /tmp/spl.kext
(gdb) add-kext /tmp/zfs.kext
(gdb) bt
</syntaxhighlight>

=== Debugging with LLDB ===

<syntaxhighlight lang="bash">
$ echo "settings set target.load-script-from-symbol-file true" >> ~/.lldbinit
$ lldb /Volumes/KernelDebugKit/mach_kernel # From Yosemite, "/Library/Developer/KDKs/KDK_10.10_14B25.kdk/System/Library/Kernels/kernel"
(lldb) kdp-remote 192.168.30.146
(lldb) showallkmods
(lldb) addkext -F /tmp/spl.kext/Contents/MacOS/spl 0xffffff7f8ebb0000 (Address from showallkmods)
(lldb) addkext -F /tmp/zfs.kext/Contents/MacOS/zfs 0xffffff7f8ebbf000
</syntaxhighlight>

Then follow the guide for GDB above.

=== Non-panic ===

If you prefer to work in GDB, you can always panic a kernel with

<syntaxhighlight lang="bash">
$ sudo dtrace -w -n "BEGIN{ panic();}"
</syntaxhighlight>

But this was revealing:

<syntaxhighlight lang="bash">
$ sudo /usr/libexec/stackshot -i -f /tmp/stackshot.log
$ sudo symstacks.rb -f /tmp/stackshot.log -s -w /tmp/trace.txt
$ less /tmp/trace.txt
</syntaxhighlight>

Note that my hang is here:

<syntaxhighlight lang="text">
PID: 156
Process: zpool
Thread ID: 0x4e2
Thread state: 0x9 == TH_WAIT |TH_UNINT
Thread wait_event: 0xffffff8006608a6c
Kernel stack:
machine_switch_context (in mach_kernel) + 366 (0xffffff80002b3d3e)
0xffffff800022e711 (in mach_kernel) + 1281 (0xffffff800022e711)
thread_block_reason (in mach_kernel) + 300 (0xffffff800022d9dc)
lck_mtx_sleep (in mach_kernel) + 78 (0xffffff80002265ce)
0xffffff8000569ef6 (in mach_kernel) + 246 (0xffffff8000569ef6)
msleep (in mach_kernel) + 116 (0xffffff800056a2e4)
0xffffff7f80e52a76 (0xffffff7f80e52a76)
0xffffff7f80e53fae (0xffffff7f80e53fae)
0xffffff7f80e54173 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)
0xffffff7f80f2bb4e (0xffffff7f80f2bb4e)
0xffffff7f80f1a9b7 (0xffffff7f80f1a9b7)
0xffffff7f80f1b65f (0xffffff7f80f1b65f)
0xffffff7f80f042ee (0xffffff7f80f042ee)
0xffffff7f80f45c5b (0xffffff7f80f45c5b)
0xffffff7f80f4ce92 (0xffffff7f80f4ce92)
spec_ioctl (in mach_kernel) + 157 (0xffffff8000320bfd)
VNOP_IOCTL (in mach_kernel) + 244 (0xffffff8000311e84)
</syntaxhighlight>

It is a shame that it only shows the kernel symbols, and not inside SPL and ZFS, but we can ask it to load another sym file. (Alas, it cannot handle multiple symbols files. Fix this Apple.)

<syntaxhighlight lang="bash">
$ sudo kextstat #grab the addresses of SPL and ZFS again
$ sudo kextutil -s /tmp -n -k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ ../spl/module/spl/spl.kext/

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.spl.sym
0xffffff800056a2e4 (0xffffff800056a2e4)
spl_cv_wait (in net.lundman.spl.sym) + 54 (0xffffff7f80e52a76)
taskq_wait (in net.lundman.spl.sym) + 78 (0xffffff7f80e53fae)
taskq_destroy (in net.lundman.spl.sym) + 35 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.zfs.sym
0xffffff7f80e54173 (0xffffff7f80e54173)
vdev_open_children (in net.lundman.zfs.sym) + 336 (0xffffff7f80f1a870)
vdev_root_open (in net.lundman.zfs.sym) + 94 (0xffffff7f80f2bb4e)
vdev_open (in net.lundman.zfs.sym) + 311 (0xffffff7f80f1a9b7)
vdev_create (in net.lundman.zfs.sym) + 31 (0xffffff7f80f1b65f)
spa_create (in net.lundman.zfs.sym) + 878 (0xffffff7f80f042ee)
</syntaxhighlight>

Voilà!

=== Memory leaks ===

(Note that this section is only relevant to old O3X implementation that used the zones allocator - we now use our own kmem allocator).

In some cases, you may suspect memory issues, for instance if you saw the following panic:

<syntaxhighlight lang="text">
panic(cpu 1 caller 0xffffff80002438d8): "zalloc: \"kalloc.1024\" (100535 elements) retry fail 3, kfree_nop_count: 0"@/SourceCache/xnu/xnu-2050.7.9/osfmk/kern/zalloc.c:1826
</syntaxhighlight>

To debug this, you can attach GDB and use the zprint command:

<syntaxhighlight lang="text">
(gdb) zprint
ZONE COUNT TOT_SZ MAX_SZ ELT_SZ ALLOC_SZ TOT_ALLOC TOT_FREE NAME
0xffffff8002a89250 1620133 18c1000 22a3599 16 1000 125203838 123583705 kalloc.16 CX
0xffffff8006306c50 110335 35f000 4ce300 32 1000 13634985 13524650 kalloc.32 CX
0xffffff8006306a00 133584 82a000 e6a900 64 1000 26510120 26376536 kalloc.64 CX
0xffffff80063067b0 610090 4a84000 614f4c0 128 1000 50524515 49914425 kalloc.128 CX
0xffffff8006306560 1070398 121a2000 1b5e4d60 256 1000 72534632 71464234 kalloc.256 CX
0xffffff8006306310 399302 d423000 daf26b0 512 1000 39231204 38831902 kalloc.512 CX
0xffffff80063060c0 100404 6231000 c29e980 1024 1000 22949693 22849289 kalloc.1024 CX
0xffffff8006305e70 292 9a000 200000 2048 1000 77633725 77633433 kalloc.2048 CX
</syntaxhighlight>

In this case, kalloc.256 is suspect.

Reboot kernel with zlog=kalloc.256 on the command line, then we can use

<syntaxhighlight lang="text">
(gdb) findoldest
oldest record is at log index 393:

--------------- ALLOC 0xffffff803276ec00 : index 393 : ztime 21643824 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
and indeed, list any index

(gdb) zstack 394

--------------- ALLOC 0xffffff8032d60700 : index 394 : ztime 21648810 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
How many times was zfs_kmem_alloc involved in the leaked allocs?

(gdb) countpcs 0xffffff7f80e847df
occurred 3999 times in log (100% of records)
</syntaxhighlight>

At least we know it is our fault.

How many times is it arc_buf_alloc?

<syntaxhighlight lang="text">
(gdb) countpcs 0xffffff7f80e90649
occurred 2390 times in log (59% of records)
</syntaxhighlight>

=== Memory Architecture ===

ZFS is designed to aggressively cache filesystem data in main memory. The result of this caching can be significant filesystem performance improvement.

Selection of an allocator has been very challenging on OS X. In the last year we have evolved from:
* Direct call to OSMalloc - a very low level allocator in the kernel - rejected because of slow performance and because the minimum allocation size is one page (4k)
* Direct call to zalloc - the OS X zones allocator - rejected because only 25% of the machines memory can be accessed (50% under some circumstances), and because the result of exceeding this limit is a kernel panic with no other feedback mechanisms available.
* Direct call to bmalloc - bmalloc was a home grown slice allocator that allocated slices of memory from the kernel page allocator, and subdivided into smaller units of allocation to use by ZFS. This was quite successful but very space inefficient. Was used in O3X 1.2.7 and 1.3.0. At this stage we had no real response to memory pressure in the machine, so the total memory allocation to O3X was kept to 50% of the machine.
* Implementation of kmem and vmem allocators using code from Illumos. Provision of a memory pressure monitor mechanism - we are now able to allocate most of the machines memory to ZFS, and scale that back when the machine experiences memory pressure.

O3X has the Solaris Porting Layer (SPL). The SPL has long since provided the Illumos kmem.h API for use by ZFS. In O3X releases up to 1.3.0 the kmem implementation has been a stub that passes allocation requests to an underlying allocator. In O3X 1.3.0 we were still missing some key behaviours in the allocator - efficient lifecycle control of objects, and an effective response to memory pressure in the machine, and the allocator was not very space efficient because of metadata overheads in bmalloc. We were also not convinced that bmalloc represented the state of the art.

Our strategy was to determine how much of the Illumos allocator could be implemented on OS X. After a series of experiments where we implemented significant portions of the kmem code from illumos on top of bmalloc, we had learned enough to take the final step of essentially copying the entire kmem/vmem allocator stack from Illumos. Some portions of the kmem code have been disabled in kmem such as logging, and hot swap CPU support have been disabled due to architectural differences between OS X and Illumos.

By default kmem/vmem require a certain level of performance from the OS page allocator. It is easy to overwhelm the OS X page allocator. We tuned vmem to use 512Kb chunks of memory from the page allocator rather than the smaller allocations that vmem prefers. This is less than ideal as it reduces the ability for vmem to smoothly release memory to the page allocator when the machine is under pressure. While we have an adequately performing solution now, there will always be a tension between our allocator and OS X itself. OS X only provides minimal mechanisms to observe and respond to memory pressure in the machine, so we are somewhat limited in what can be achieved in this regard.

References:

Jeff Bonwicks paper - kmem and vmem implement this design. https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/

=== Detecting memory handling errors ===

The kmem allocator has an internal diagnostic mode. In diagnostic mode the allocator instruments heap memory with various features and markers as it is allocated and released by application code. These markers are checked as the program runs, and can determine when an application has exhibited one or more of a set of common memory handling errors. The debugging mode is disabled by default as it carries a significant performance penalty.

The memory handling errors that can be detected include:
* Modify after free
* Write past end of buffer
* Free of memory not managed by kmem
* Double free of memory
* Various other corruptions
* Freed size != allocated size
* Freed address != allocated address

Debug mode is enabled by compiling with the preprocessor symbol DEBUG defined. At a minimum spl-kmem.c and spl-osx.c need to see this define for the debugging features to be completely enabled.

In debugging mode you must choose whether kmem will log the fault and then panic, or just log. If you elect to panic, there is a very high chance that the full log message will not be stored in system.log before the OS halts, and you will have to connect to the machine with lldb and use the "systemlog" command to view the diagnostic message. If you elect to not panic, the program will continue to run despite the memory corruption, with undefined consequences. In spl-kmem.c set kmem_panic=0 to log, kmem_panic=1 to log+panic.

Example:

I modified spl_start() to include the following:

{
...
int *p;
for(int i=0; i<20;i++) {
p = (int*)spl_kmem_alloc(1024);
spl_kmem_free(p);
*p = 0;
}
}

With the debug mode enabled the following was logged:

14/08/2015 5:09:47.000 PM kernel[0]: SPL: kernel memory allocator: buffer modified after being freed
14/08/2015 5:09:47.000 PM kernel[0]: SPL: modification occurred at offset 0x0 (0xdeadbeefdeadbeef replaced by 0xdeadbeef00000000)
14/08/2015 5:09:47.000 PM kernel[0]: SPL: buffer=0xffffff887a87d980 bufctl=0xffffff887a7ad840 cache: kmem_alloc_1152
14/08/2015 5:09:47.000 PM kernel[0]: SPL: previous transaction on buffer 0xffffff887a87d980:
14/08/2015 5:09:47.000 PM kernel[0]: SPL: thread=0 time=T-0.000001383 slab=0xffffff887a5ffe68 cache: kmem_alloc_1152
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _kmem_cache_free_debug + 0x227
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _kmem_cache_free + 0x173
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _zfs_kmem_free + 0x2c4
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _spl_start + 0x2bb
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext5startEb + 0x40b
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext4loadEhhP7OSArray + 0xdd
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext4loadEhhP7OSArray + 0x3e1
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext22loadKextWithIdentifierEP8OSStringbbhhP7OSArray + 0xf2
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZNK11IOCatalogue14isModuleLoadedEP12OSDictionary + 0xe0
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN9IOService15probeCandidatesEP12OSOrderedSet + 0x2c4
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN9IOService14doServiceMatchEj + 0x22a
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN15_IOConfigThread4mainEPvi + 0x13c
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : _call_continuation + 0x17

You can clearly see the kind of memory corruption, the actual corrupted data, which kmem cache was involved, the relative time that the last action occurred, and the stack trace for the last action (which was a call to zfs_kmem_free()) - indicating that spl_start() was implicated in the fault. This event would have logged on the next allocated after the free and modify occurred.

=== Compiling to lower OSX versions ===

If you wish to compile O3X to a specific OSX version, in this case, compiling for 10.9 on a 10.10

SPL:
./configure --with-kernel-headers=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

ZFS:
./configure --with-kernelsrc=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

== Flamegraphs ==

Huge thanks to [http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html BrendanGregg] for so much of the dtrace magic.

dtrace the kernel while running a command:

<syntaxhighlight lang="bash">
$ sudo dtrace -x stackframes=100 -n 'profile-997 /arg0/ {
@[stack()] = count(); } tick-60s { exit(0); }' -o out.stacks
</syntaxhighlight>

It will run for 60 seconds.

Convert it to a flamegraph:

<syntaxhighlight lang="bash">
$ ./stackcollapse.pl out.stacks > out.folded
$ ./flamegraph.pl out.folded > out.svg
</syntaxhighlight>

This is <code>rsync -a /usr/ /BOOM/deletea/</code> running:

[[File:rsyncflamegraph.svg|thumb|rsync flamegraph]]

Or running '''Bonnie++''' in various stages:

<gallery mode="packed-hover">
File:create.svg|Create files in sequential order|alt=[[File:create.svg]]
File:stat.svg|Stat files in sequential order|alt=Stat files in sequential order
File:delete.svg|Delete files in sequential order|alt=Delete files in sequential order
</gallery>

[[File:VX_create.svg|thumb|Create files in sequential order]]

 

[[File:iozone.svg|thumb|IOzone flamegraph]]

[[File:iozoneX.svg|thumb|IOzone flamegraph (untrimmed)]]

 

------

== Unit Test ==

We have created an initial port of the standard ZFS test suite. It consists of a collection of scripts and miscellaneous utility programs and exercise the complete breadth and depth of the ZFS filesystem.

The tests are best run in a virtual machine with a baseline configured setup that has been captured in a snapshot. The tests should be run on the VM, and then due to the destructive nature of the tests, the VM should be reverted to the snapshot in preparation for future test runs. The tests take 2-4 hours to run depending on hardware setup.

=== Setup ===

The user zfs-test needs to be able to run sudo without issuing a password. Add the following to sudoers:

zfs-tests ALL=(ALL) NOPASSWD: ALL

The sudo root environment must be configured to pass certain enviroment variables from zfs-test through to the root environment. Add the following to sudoers:

Defaults env_keep += "__ZFS_MAIN_MOUNTPOINT_DIR"

Modify /etc/bashrc to contain

export __ZFS_MAIN_MOUNTPOINT_DIR="/"

If your development directory is ~you/Developer, clone zfs, spl and bfs-tests into that directory

# cd ~you/Developer
# git clone git@github.com:openzfsonosx/zfs-test.git
# git clone git@github.com:openzfsonosx/zfs.git
# git clone git@github.com:openzfsonosx/spl.git

Build the ZFS is built using the building from source instructions.

Ensure that /var/tmp has approximately 100GB of free space.

Create theee virtual hard drives of 10-20GB capacity each.

=== Run Test Suite ===

Setup the tests to run

# cd ~you/Developer/zfs-tests
# ./autogen.sh
# ./configure CC=clang CXX=clang++

Edit the generated Makefile, change the recipe for the test_hw target such that your three virtual disks are listed in the DISKS environment variable.

test_hw: test_verify test/zfs-tests/cmd
@KEEP="`zpool list -H -oname`" \
STF_TOOLS=$(abs_top_srcdir)/test/test-runner/stf \
STF_SUITE=$(abs_top_srcdir)/test/zfs-tests \
DISKS="/dev/disk3 /dev/disk1 /dev/disk2" \
su zfs-tests -c "ksh $(abs_top_srcdir)/test/zfs-tests/cmd/scripts/zfstest.ksh $$RUNFILE"

Run the test suite

sudo make test_hw

=== Results ===

The test suite write summary pass/fail information to the console as they run. On completion of the test run summary statistics are written to the console.

Test log files are stored in /var/tmp/<testrun> (where test run is a unique looking number). In that directory there is a log file, and a directory per test. Within the test directory is detailed log information regarding the specific test.

== Iozone ==

Quick peek at how they compare, just to see how much we should improve it by.

HFS+ and ZFS were created on the same virtual disk in VMware. Of course, this is not ideal testing specs, but should serve as an indicator.

The pool was created with

<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 \
-O atime=off \
-O casesensitivity=insensitive \
-O normalization=formD \
BOOM /dev/disk1
</syntaxhighlight>

and the HFS+ file system was created with the standard OS X Disk Utility.app, with everything default (journaled, case-insensitive).

'''Iozone''' was run with standard automode:

<syntaxhighlight lang="bash">
sudo iozone -a -b outfile.xls
</syntaxhighlight>

[[File:hfs2_read.png|thumb|HFS+ read]]
[[File:hfs2_write.png|thumb|HFS+ write]]
[[File:zfs2_read.png|thumb|ZFS read]]
[[File:zfs2_write.png|thumb|ZFS write]]

As a guess, writes need to double, and reads need to triple.

=== VFS ===

[[VFS]]

== File-based zpools for testing==

* create 2 files (each 100 MB) to be used as block devices:
<syntaxhighlight lang="bash">
$ dd if=/dev/zero bs=1m count=100 of=vdisk1
$ dd if=/dev/zero bs=1m count=100 of=vdisk2
</syntaxhighlight>

* attach files as raw disk images:
<syntaxhighlight lang="bash">
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk1
/dev/disk2
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk2
/dev/disk3
</syntaxhighlight>

* create mirrored zpool:
<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 -O casesensitivity=insensitive -O normalization=formD tank mirror disk2 disk3
</syntaxhighlight>

* show zpool:
<syntaxhighlight lang="bash">
$ sudo zpool status
pool: tank
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
disk2 ONLINE 0 0 0
disk3 ONLINE 0 0 0

errors: No known data errors
</syntaxhighlight>

* test ZFS features, find bugs, ...

* export zpool:
<syntaxhighlight lang="bash">
$ sudo zpool export tank
</syntaxhighlight>

* detach raw images:
<syntaxhighlight lang="bash">
$ hdiutil detach disk2
"disk2" unmounted.
"disk2" ejected.
$ hdiutil detach disk3
"disk3" unmounted.
"disk3" ejected.
</syntaxhighlight>

== Platform differences ==

This section is an attempt to outline the differences from ZFS versions of other platforms, as compared to OS X. To assist developers new to the Apple platform, who wishes to assist, or understand, development of the O3X version.

=== Reclaim ===

One of the biggest hassles with OS X is the VFS layer's handling of reclaim. First it is worth noting that "struct vnode" is an opaque type, so we are not allowed to see, nor modify, the contents of a vnode.
(Of course, we could craft a mirror struct of vnode and tailor it to each OS X version where vnode changes. But that is rather hacky.)

Following that, the '''only''' place where you can set the '''vtype''' (VREG, VDIR), '''vdata''' (user pointer to hold the ZFS znode), '''vfsops''' (list of filesystem calls "vnops") etc, is '''only in calling vnode_create()'''.
So there is no way to "allocate an empty vnode, and set its values later". The FreeBSD method of pre-allocating vnodes, to avoid reclaim, can not be done.
ZFS will start a new dmu_tx, then call zfs_mknode which will eventually call vnode_create, so we can not do anything with dmu_tx in those vnops.

The problem is, if vnode_create decides to reclaim, it will do so directly, as the same thread. It will end up in vclean() which can call vnop_fsync, vnop_pageout, vnop_inactive and vnop_reclaim. The first three of these calls, we can
use the API call vnode_isrecycled() to detect if these vnops are called "the normal way", or from vclean. If we come from vclean, and the vnode is doomed, we will do as little as possible. We can not open a new TX, and
we can not use mutex locks (panic: locking against ourselves).

Nor is there any way to defer, or delay, a doomed vnode. If vnop_reclaim returns anything but 0, you find the lovely XNU code of
2205 if (VNOP_RECLAIM(vp, ctx))
2206 panic("vclean: cannot reclaim");
in vfs_subr.c

So, at the moment there is some extra logic in '''zfs_vnop_reclaim''' to handle that we might be re-entrant as the '''vnode_create''' thread.

exception = ((zp->z_sa_hdl != NULL) &&
zp->z_unlinked) ? B_TRUE : B_FALSE;
fastpath = zp->z_fastpath;

if both exception and fastpath are FALSE, we can call direct reclaim right there. As in those cases, no final dmu_tx is caused. Following
the zfs_rmnode->zfs_purgedir->zget and similar paths, exception is set to TRUE.

If exception is TRUE, we add the zp to the reclaim_list, and the separate reclaim_thread will call zfs_rmnode(zp). As a separate thread it can handle calling
dmu_tx.

If fastpath is TRUE, we do no more/nothing in zfs_vnop_reclaim. See below.

=== Fastpath vs Recycle ===

Another interesting aspect is that IllumOS has a delete fastpath. In zfs_remove, if it is detected that the znode can be "deleted_now", it marks the vnode as free and directly calls zfs_znode_delete(), if it can not, adds it to zfs_unlinked_add().

In OS X, there is no way to directly release a vnode. Ie, XNU always has full control of the vnodes. Even if you call vnode_recycle(), the vnode is not released '''until''' vnop_reclaim is called. The vnode can just be marked for later reclaim, but remain active (especially if you are racing against other threads using the same vnode). So in zfs_remove, we attempt to call vnode_recycle(), and only if this returns "1" do we know that vnop_reclaim was called, and we can directly call zfs_znode_delete(). Note that the O3X vnop_reclaim handler then has special code to not do anything with the vnode (zp->z_fastpath) but to only clear out the z_vnode and return.

zp->z_fastpath = B_TRUE;
if (vnode_recycle(vp) == 1) {
/* recycle/reclaim is done, so we can just release now */
zfs_znode_delete(zp, tx);
} else {
/* failed to recycle, so just place it on the unlinked list */
zp->z_fastpath = B_FALSE;
zfs_unlinked_add(zp, tx);
}

There is also a little special lock-handling in zfs_zinactive, since we can call it from inside of a vnode_create() which is called by ZFS with locks held. If this is the case, we do not attempt to acquire locks in zfs_zinactive.

=== snapshot mounts ===

There is no way to cause a mount in XNU kernel. None. At. All. Apple themselves cheated and added a static nfsmount() that we can not call. So instead, we have to jump through a whole bunch of
hoops to get there. We create a fake/virtual /dev/diskX entry for the snapshot. '''diskarbitrationd''' will wake up due to new disk, it will enter the probe phase, which includes calling
all the /System/Library/Filesystems/ bundles. Eventually, zfs.util is called and we reply affirmative. However, automount is disable here, as there is no way to specify a mountpoint with auto.
zfs.util will call DADiskMount to mount it to the correct directory.

This means we have a few more VNOPs in zfs_ctldir.c, as we have to reply with correct information to make mount successful. The first getattr will cause the mount attempt, the DADiskMount call will cause getattr to be called
and we have to pretend to have said entry.

=== spl_vn_rdwr vs vn_rdwr ===

There are two calls to vn_rdwr() in OSX's SPL. The '''spl_vn_rdwr()''' call needs to be used when zfs_onexit is in use. For example, dmu_send.c (zfs recv/send) and zfs_ioc_diff (zfs diff). The XNU implementation of
zfs_onexit (as in calls to '''getf''' and '''releasef''') need to place the internal XNU ''struct fileproc''' in the wrapper ''struct spl_fileproc'', so that '''spl_vn_rdwr()''' can use it to do IO.
This is the only way to do IO on a non-file based vnode (ie, pipe or socket). Other places that call vn_rdwr(), for example vdev_file.c, needs to call the regular vn_rdwr.

=== getattr ===

XNU has a whole bunch of items that it can ask for in vnop_getattr, including VA_NAME, which is used heavily by Finder (especially in the vfs_vget path). Care is needed here to return the correct name,
including for link (hard links) targets. VNOP_LOOKUP records the name that was used in the lookup, so that a following stat call (vnop_getattr) on the vnode will return the correct name if VA_NAME is requested.

Development

2015-08-15T05:10:11Z

101.175.67.14: added unit test outline

[[Category:O3X development]]
You should also familiarize yourself with the [[Project_roadmap|project roadmap]] so that you can put the technical details here in context.

== Kernel ==

=== Debugging with GDB ===

Dealing with [[Panic|panics]].

Apple's documentation: https://developer.apple.com/library/mac/documentation/Darwin/Conceptual/KEXTConcept/KEXTConceptDebugger/debug_tutorial.html

Boot target VM with

<syntaxhighlight lang="bash">
$ sudo nvram boot-args="-v keepsyms=y debug=0x144"
</syntaxhighlight>

Make it panic.

On your development machine, you will need the Kernel Debug Kit. Download it from Apple [https://developer.apple.com/downloads/index.action?q=Kernel%20Debug%20Kit here].

<syntaxhighlight lang="text">
$ gdb /Volumes/Kernelit/mach_kernel
(gdb) source /Volumes/KernelDebugKit/kgmacros
(gdb) target remote-kdp
(gdb) kdp-reattach 192.168.30.133 # obviously use the IP of your target / crashed VM
(gdb) showallkmods
</syntaxhighlight>

Find the addresses for ZFS and SPL modules.

<code>^Z</code> to suspend gdb, or, use another terminal

<syntaxhighlight lang="bash">
^Z
$ sudo kextutil -s /tmp -n \
-k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ \
../spl/module/spl/spl.kext/
</syntaxhighlight>

Then resume gdb, or go back to gdb terminal.
<syntaxhighlight lang="text">
$ fg
(gdb) set kext-symbol-file-path /tmp
(gdb) add-kext /tmp/spl.kext
(gdb) add-kext /tmp/zfs.kext
(gdb) bt
</syntaxhighlight>

=== Debugging with LLDB ===

<syntaxhighlight lang="bash">
$ echo "settings set target.load-script-from-symbol-file true" >> ~/.lldbinit
$ lldb /Volumes/KernelDebugKit/mach_kernel # From Yosemite, "/Library/Developer/KDKs/KDK_10.10_14B25.kdk/System/Library/Kernels/kernel"
(lldb) kdp-remote 192.168.30.146
(lldb) showallkmods
(lldb) addkext -F /tmp/spl.kext/Contents/MacOS/spl 0xffffff7f8ebb0000 (Address from showallkmods)
(lldb) addkext -F /tmp/zfs.kext/Contents/MacOS/zfs 0xffffff7f8ebbf000
</syntaxhighlight>

Then follow the guide for GDB above.

=== Non-panic ===

If you prefer to work in GDB, you can always panic a kernel with

<syntaxhighlight lang="bash">
$ sudo dtrace -w -n "BEGIN{ panic();}"
</syntaxhighlight>

But this was revealing:

<syntaxhighlight lang="bash">
$ sudo /usr/libexec/stackshot -i -f /tmp/stackshot.log
$ sudo symstacks.rb -f /tmp/stackshot.log -s -w /tmp/trace.txt
$ less /tmp/trace.txt
</syntaxhighlight>

Note that my hang is here:

<syntaxhighlight lang="text">
PID: 156
Process: zpool
Thread ID: 0x4e2
Thread state: 0x9 == TH_WAIT |TH_UNINT
Thread wait_event: 0xffffff8006608a6c
Kernel stack:
machine_switch_context (in mach_kernel) + 366 (0xffffff80002b3d3e)
0xffffff800022e711 (in mach_kernel) + 1281 (0xffffff800022e711)
thread_block_reason (in mach_kernel) + 300 (0xffffff800022d9dc)
lck_mtx_sleep (in mach_kernel) + 78 (0xffffff80002265ce)
0xffffff8000569ef6 (in mach_kernel) + 246 (0xffffff8000569ef6)
msleep (in mach_kernel) + 116 (0xffffff800056a2e4)
0xffffff7f80e52a76 (0xffffff7f80e52a76)
0xffffff7f80e53fae (0xffffff7f80e53fae)
0xffffff7f80e54173 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)
0xffffff7f80f2bb4e (0xffffff7f80f2bb4e)
0xffffff7f80f1a9b7 (0xffffff7f80f1a9b7)
0xffffff7f80f1b65f (0xffffff7f80f1b65f)
0xffffff7f80f042ee (0xffffff7f80f042ee)
0xffffff7f80f45c5b (0xffffff7f80f45c5b)
0xffffff7f80f4ce92 (0xffffff7f80f4ce92)
spec_ioctl (in mach_kernel) + 157 (0xffffff8000320bfd)
VNOP_IOCTL (in mach_kernel) + 244 (0xffffff8000311e84)
</syntaxhighlight>

It is a shame that it only shows the kernel symbols, and not inside SPL and ZFS, but we can ask it to load another sym file. (Alas, it cannot handle multiple symbols files. Fix this Apple.)

<syntaxhighlight lang="bash">
$ sudo kextstat #grab the addresses of SPL and ZFS again
$ sudo kextutil -s /tmp -n -k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ ../spl/module/spl/spl.kext/

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.spl.sym
0xffffff800056a2e4 (0xffffff800056a2e4)
spl_cv_wait (in net.lundman.spl.sym) + 54 (0xffffff7f80e52a76)
taskq_wait (in net.lundman.spl.sym) + 78 (0xffffff7f80e53fae)
taskq_destroy (in net.lundman.spl.sym) + 35 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.zfs.sym
0xffffff7f80e54173 (0xffffff7f80e54173)
vdev_open_children (in net.lundman.zfs.sym) + 336 (0xffffff7f80f1a870)
vdev_root_open (in net.lundman.zfs.sym) + 94 (0xffffff7f80f2bb4e)
vdev_open (in net.lundman.zfs.sym) + 311 (0xffffff7f80f1a9b7)
vdev_create (in net.lundman.zfs.sym) + 31 (0xffffff7f80f1b65f)
spa_create (in net.lundman.zfs.sym) + 878 (0xffffff7f80f042ee)
</syntaxhighlight>

Voilà!

=== Memory leaks ===

(Note that this section is only relevant to old O3X implementation that used the zones allocator - we now use our own kmem allocator).

In some cases, you may suspect memory issues, for instance if you saw the following panic:

<syntaxhighlight lang="text">
panic(cpu 1 caller 0xffffff80002438d8): "zalloc: \"kalloc.1024\" (100535 elements) retry fail 3, kfree_nop_count: 0"@/SourceCache/xnu/xnu-2050.7.9/osfmk/kern/zalloc.c:1826
</syntaxhighlight>

To debug this, you can attach GDB and use the zprint command:

<syntaxhighlight lang="text">
(gdb) zprint
ZONE COUNT TOT_SZ MAX_SZ ELT_SZ ALLOC_SZ TOT_ALLOC TOT_FREE NAME
0xffffff8002a89250 1620133 18c1000 22a3599 16 1000 125203838 123583705 kalloc.16 CX
0xffffff8006306c50 110335 35f000 4ce300 32 1000 13634985 13524650 kalloc.32 CX
0xffffff8006306a00 133584 82a000 e6a900 64 1000 26510120 26376536 kalloc.64 CX
0xffffff80063067b0 610090 4a84000 614f4c0 128 1000 50524515 49914425 kalloc.128 CX
0xffffff8006306560 1070398 121a2000 1b5e4d60 256 1000 72534632 71464234 kalloc.256 CX
0xffffff8006306310 399302 d423000 daf26b0 512 1000 39231204 38831902 kalloc.512 CX
0xffffff80063060c0 100404 6231000 c29e980 1024 1000 22949693 22849289 kalloc.1024 CX
0xffffff8006305e70 292 9a000 200000 2048 1000 77633725 77633433 kalloc.2048 CX
</syntaxhighlight>

In this case, kalloc.256 is suspect.

Reboot kernel with zlog=kalloc.256 on the command line, then we can use

<syntaxhighlight lang="text">
(gdb) findoldest
oldest record is at log index 393:

--------------- ALLOC 0xffffff803276ec00 : index 393 : ztime 21643824 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
and indeed, list any index

(gdb) zstack 394

--------------- ALLOC 0xffffff8032d60700 : index 394 : ztime 21648810 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
How many times was zfs_kmem_alloc involved in the leaked allocs?

(gdb) countpcs 0xffffff7f80e847df
occurred 3999 times in log (100% of records)
</syntaxhighlight>

At least we know it is our fault.

How many times is it arc_buf_alloc?

<syntaxhighlight lang="text">
(gdb) countpcs 0xffffff7f80e90649
occurred 2390 times in log (59% of records)
</syntaxhighlight>

=== Memory Architecture ===

ZFS is designed to aggressively cache filesystem data in main memory. The result of this caching can be significant filesystem performance improvement.

Selection of an allocator has been very challenging on OS X. In the last year we have evolved from:
* Direct call to OSMalloc - a very low level allocator in the kernel - rejected because of slow performance and because the minimum allocation size is one page (4k)
* Direct call to zalloc - the OS X zones allocator - rejected because only 25% of the machines memory can be accessed (50% under some circumstances), and because the result of exceeding this limit is a kernel panic with no other feedback mechanisms available.
* Direct call to bmalloc - bmalloc was a home grown slice allocator that allocated slices of memory from the kernel page allocator, and subdivided into smaller units of allocation to use by ZFS. This was quite successful but very space inefficient. Was used in O3X 1.2.7 and 1.3.0. At this stage we had no real response to memory pressure in the machine, so the total memory allocation to O3X was kept to 50% of the machine.
* Implementation of kmem and vmem allocators using code from Illumos. Provision of a memory pressure monitor mechanism - we are now able to allocate most of the machines memory to ZFS, and scale that back when the machine experiences memory pressure.

O3X has the Solaris Porting Layer (SPL). The SPL has long since provided the Illumos kmem.h API for use by ZFS. In O3X releases up to 1.3.0 the kmem implementation has been a stub that passes allocation requests to an underlying allocator. In O3X 1.3.0 we were still missing some key behaviours in the allocator - efficient lifecycle control of objects, and an effective response to memory pressure in the machine, and the allocator was not very space efficient because of metadata overheads in bmalloc. We were also not convinced that bmalloc represented the state of the art.

Our strategy was to determine how much of the Illumos allocator could be implemented on OS X. After a series of experiments where we implemented significant portions of the kmem code from illumos on top of bmalloc, we had learned enough to take the final step of essentially copying the entire kmem/vmem allocator stack from Illumos. Some portions of the kmem code have been disabled in kmem such as logging, and hot swap CPU support have been disabled due to architectural differences between OS X and Illumos.

By default kmem/vmem require a certain level of performance from the OS page allocator. It is easy to overwhelm the OS X page allocator. We tuned vmem to use 512Kb chunks of memory from the page allocator rather than the smaller allocations that vmem prefers. This is less than ideal as it reduces the ability for vmem to smoothly release memory to the page allocator when the machine is under pressure. While we have an adequately performing solution now, there will always be a tension between our allocator and OS X itself. OS X only provides minimal mechanisms to observe and respond to memory pressure in the machine, so we are somewhat limited in what can be achieved in this regard.

References:

Jeff Bonwicks paper - kmem and vmem implement this design. https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/

=== Detecting memory handling errors ===

The kmem allocator has an internal diagnostic mode. In diagnostic mode the allocator instruments heap memory with various features and markers as it is allocated and released by application code. These markers are checked as the program runs, and can determine when an application has exhibited one or more of a set of common memory handling errors. The debugging mode is disabled by default as it carries a significant performance penalty.

The memory handling errors that can be detected include:
* Modify after free
* Write past end of buffer
* Free of memory not managed by kmem
* Double free of memory
* Various other corruptions
* Freed size != allocated size
* Freed address != allocated address

Debug mode is enabled by compiling with the preprocessor symbol DEBUG defined. At a minimum spl-kmem.c and spl-osx.c need to see this define for the debugging features to be completely enabled.

In debugging mode you must choose whether kmem will log the fault and then panic, or just log. If you elect to panic, there is a very high chance that the full log message will not be stored in system.log before the OS halts, and you will have to connect to the machine with lldb and use the "systemlog" command to view the diagnostic message. If you elect to not panic, the program will continue to run despite the memory corruption, with undefined consequences. In spl-kmem.c set kmem_panic=0 to log, kmem_panic=1 to log+panic.

Example:

I modified spl_start() to include the following:

{
...
int *p;
for(int i=0; i<20;i++) {
p = (int*)spl_kmem_alloc(1024);
spl_kmem_free(p);
*p = 0;
}
}

With the debug mode enabled the following was logged:

14/08/2015 5:09:47.000 PM kernel[0]: SPL: kernel memory allocator: buffer modified after being freed
14/08/2015 5:09:47.000 PM kernel[0]: SPL: modification occurred at offset 0x0 (0xdeadbeefdeadbeef replaced by 0xdeadbeef00000000)
14/08/2015 5:09:47.000 PM kernel[0]: SPL: buffer=0xffffff887a87d980 bufctl=0xffffff887a7ad840 cache: kmem_alloc_1152
14/08/2015 5:09:47.000 PM kernel[0]: SPL: previous transaction on buffer 0xffffff887a87d980:
14/08/2015 5:09:47.000 PM kernel[0]: SPL: thread=0 time=T-0.000001383 slab=0xffffff887a5ffe68 cache: kmem_alloc_1152
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _kmem_cache_free_debug + 0x227
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _kmem_cache_free + 0x173
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _zfs_kmem_free + 0x2c4
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _spl_start + 0x2bb
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext5startEb + 0x40b
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext4loadEhhP7OSArray + 0xdd
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext4loadEhhP7OSArray + 0x3e1
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext22loadKextWithIdentifierEP8OSStringbbhhP7OSArray + 0xf2
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZNK11IOCatalogue14isModuleLoadedEP12OSDictionary + 0xe0
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN9IOService15probeCandidatesEP12OSOrderedSet + 0x2c4
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN9IOService14doServiceMatchEj + 0x22a
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN15_IOConfigThread4mainEPvi + 0x13c
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : _call_continuation + 0x17

You can clearly see the kind of memory corruption, the actual corrupted data, which kmem cache was involved, the relative time that the last action occurred, and the stack trace for the last action (which was a call to zfs_kmem_free()) - indicating that spl_start() was implicated in the fault. This event would have logged on the next allocated after the free and modify occurred.

=== Compiling to lower OSX versions ===

If you wish to compile O3X to a specific OSX version, in this case, compiling for 10.9 on a 10.10

SPL:
./configure --with-kernel-headers=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

ZFS:
./configure --with-kernelsrc=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

== Flamegraphs ==

Huge thanks to [http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html BrendanGregg] for so much of the dtrace magic.

dtrace the kernel while running a command:

<syntaxhighlight lang="bash">
$ sudo dtrace -x stackframes=100 -n 'profile-997 /arg0/ {
@[stack()] = count(); } tick-60s { exit(0); }' -o out.stacks
</syntaxhighlight>

It will run for 60 seconds.

Convert it to a flamegraph:

<syntaxhighlight lang="bash">
$ ./stackcollapse.pl out.stacks > out.folded
$ ./flamegraph.pl out.folded > out.svg
</syntaxhighlight>

This is <code>rsync -a /usr/ /BOOM/deletea/</code> running:

[[File:rsyncflamegraph.svg|thumb|rsync flamegraph]]

Or running '''Bonnie++''' in various stages:

<gallery mode="packed-hover">
File:create.svg|Create files in sequential order|alt=[[File:create.svg]]
File:stat.svg|Stat files in sequential order|alt=Stat files in sequential order
File:delete.svg|Delete files in sequential order|alt=Delete files in sequential order
</gallery>

[[File:VX_create.svg|thumb|Create files in sequential order]]

 

[[File:iozone.svg|thumb|IOzone flamegraph]]

[[File:iozoneX.svg|thumb|IOzone flamegraph (untrimmed)]]

 

------

== Unit Test ==

We have created an initial port of the standard ZFS test suite. It consists of a collection of scripts and miscellaneous utility programs and exercise the complete breadth and depth of the ZFS filesystem.

The tests are best run in a virtual machine with a baseline configured setup that has been captured in a snapshot. The tests should be run on the VM, and then due to the destructive nature of the tests, the VM should be reverted to the snapshot in preparation for future test runs. The tests take 2-4 hours to run depending on hardware setup.

=== Setup ===

The user zfs-test needs to be able to run sudo without issuing a password. Add the following to sudoers:

zfs-tests ALL=(ALL) NOPASSWD: ALL

The sudo root environment must be configured to pass certain enviroment variables from zfs-test through to the root environment. Add the following to sudoers:

Defaults env_keep += "__ZFS_MAIN_MOUNTPOINT_DIR"

Modify /etc/bashrc to contain

export __ZFS_MAIN_MOUNTPOINT_DIR="/"

If your development directory is ~you/Developer, clone zfs, spl and bfs-tests into that directory

# cd ~you/Developer
# git clone git@github.com:openzfsonosx/zfs-test.git
# git clone git@github.com:openzfsonosx/zfs.git
# git clone git@github.com:openzfsonosx/spl.git

Build the ZFS is built using the building from source instructions.

Ensure that /var/tmp has approximately 100GB of free space.

Create theee virtual hard drives of 10-20GB capacity each.

=== Run Test Suite ===

Setup the tests to run

# cd ~you/Developer/zfs-tests
# ./autogen.sh
# ./configure CC=clang CXX=clang++

Edit the generated Makefile, change the recipe for the test_hw target such that your three virtual disks are listed in the DISKS environment variable.

test_hw: test_verify test/zfs-tests/cmd
@KEEP="`zpool list -H -oname`" \
STF_TOOLS=$(abs_top_srcdir)/test/test-runner/stf \
STF_SUITE=$(abs_top_srcdir)/test/zfs-tests \
DISKS="/dev/disk3 /dev/disk1 /dev/disk2" \
su zfs-tests -c "ksh $(abs_top_srcdir)/test/zfs-tests/cmd/scripts/zfstest.ksh $$RUNFILE"

Run the test suite

sudo make test

=== Results ===

The test suite write summary pass/fail information to the console as they run. On completion of the test run summary statistics are written to the console.

Test log files are stored in /var/tmp/<testrun> (where test run is a unique looking number). In that directory there is a log file, and a directory per test. Within the test directory is detailed log information regarding the specific test.

== Iozone ==

Quick peek at how they compare, just to see how much we should improve it by.

HFS+ and ZFS were created on the same virtual disk in VMware. Of course, this is not ideal testing specs, but should serve as an indicator.

The pool was created with

<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 \
-O atime=off \
-O casesensitivity=insensitive \
-O normalization=formD \
BOOM /dev/disk1
</syntaxhighlight>

and the HFS+ file system was created with the standard OS X Disk Utility.app, with everything default (journaled, case-insensitive).

'''Iozone''' was run with standard automode:

<syntaxhighlight lang="bash">
sudo iozone -a -b outfile.xls
</syntaxhighlight>

[[File:hfs2_read.png|thumb|HFS+ read]]
[[File:hfs2_write.png|thumb|HFS+ write]]
[[File:zfs2_read.png|thumb|ZFS read]]
[[File:zfs2_write.png|thumb|ZFS write]]

As a guess, writes need to double, and reads need to triple.

=== VFS ===

[[VFS]]

== File-based zpools for testing==

* create 2 files (each 100 MB) to be used as block devices:
<syntaxhighlight lang="bash">
$ dd if=/dev/zero bs=1m count=100 of=vdisk1
$ dd if=/dev/zero bs=1m count=100 of=vdisk2
</syntaxhighlight>

* attach files as raw disk images:
<syntaxhighlight lang="bash">
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk1
/dev/disk2
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk2
/dev/disk3
</syntaxhighlight>

* create mirrored zpool:
<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 -O casesensitivity=insensitive -O normalization=formD tank mirror disk2 disk3
</syntaxhighlight>

* show zpool:
<syntaxhighlight lang="bash">
$ sudo zpool status
pool: tank
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
disk2 ONLINE 0 0 0
disk3 ONLINE 0 0 0

errors: No known data errors
</syntaxhighlight>

* test ZFS features, find bugs, ...

* export zpool:
<syntaxhighlight lang="bash">
$ sudo zpool export tank
</syntaxhighlight>

* detach raw images:
<syntaxhighlight lang="bash">
$ hdiutil detach disk2
"disk2" unmounted.
"disk2" ejected.
$ hdiutil detach disk3
"disk3" unmounted.
"disk3" ejected.
</syntaxhighlight>

== Platform differences ==

This section is an attempt to outline the differences from ZFS versions of other platforms, as compared to OS X. To assist developers new to the Apple platform, who wishes to assist, or understand, development of the O3X version.

=== Reclaim ===

One of the biggest hassles with OS X is the VFS layer's handling of reclaim. First it is worth noting that "struct vnode" is an opaque type, so we are not allowed to see, nor modify, the contents of a vnode.
(Of course, we could craft a mirror struct of vnode and tailor it to each OS X version where vnode changes. But that is rather hacky.)

Following that, the '''only''' place where you can set the '''vtype''' (VREG, VDIR), '''vdata''' (user pointer to hold the ZFS znode), '''vfsops''' (list of filesystem calls "vnops") etc, is '''only in calling vnode_create()'''.
So there is no way to "allocate an empty vnode, and set its values later". The FreeBSD method of pre-allocating vnodes, to avoid reclaim, can not be done.
ZFS will start a new dmu_tx, then call zfs_mknode which will eventually call vnode_create, so we can not do anything with dmu_tx in those vnops.

The problem is, if vnode_create decides to reclaim, it will do so directly, as the same thread. It will end up in vclean() which can call vnop_fsync, vnop_pageout, vnop_inactive and vnop_reclaim. The first three of these calls, we can
use the API call vnode_isrecycled() to detect if these vnops are called "the normal way", or from vclean. If we come from vclean, and the vnode is doomed, we will do as little as possible. We can not open a new TX, and
we can not use mutex locks (panic: locking against ourselves).

Nor is there any way to defer, or delay, a doomed vnode. If vnop_reclaim returns anything but 0, you find the lovely XNU code of
2205 if (VNOP_RECLAIM(vp, ctx))
2206 panic("vclean: cannot reclaim");
in vfs_subr.c

So, at the moment there is some extra logic in '''zfs_vnop_reclaim''' to handle that we might be re-entrant as the '''vnode_create''' thread.

exception = ((zp->z_sa_hdl != NULL) &&
zp->z_unlinked) ? B_TRUE : B_FALSE;
fastpath = zp->z_fastpath;

if both exception and fastpath are FALSE, we can call direct reclaim right there. As in those cases, no final dmu_tx is caused. Following
the zfs_rmnode->zfs_purgedir->zget and similar paths, exception is set to TRUE.

If exception is TRUE, we add the zp to the reclaim_list, and the separate reclaim_thread will call zfs_rmnode(zp). As a separate thread it can handle calling
dmu_tx.

If fastpath is TRUE, we do no more/nothing in zfs_vnop_reclaim. See below.

=== Fastpath vs Recycle ===

Another interesting aspect is that IllumOS has a delete fastpath. In zfs_remove, if it is detected that the znode can be "deleted_now", it marks the vnode as free and directly calls zfs_znode_delete(), if it can not, adds it to zfs_unlinked_add().

In OS X, there is no way to directly release a vnode. Ie, XNU always has full control of the vnodes. Even if you call vnode_recycle(), the vnode is not released '''until''' vnop_reclaim is called. The vnode can just be marked for later reclaim, but remain active (especially if you are racing against other threads using the same vnode). So in zfs_remove, we attempt to call vnode_recycle(), and only if this returns "1" do we know that vnop_reclaim was called, and we can directly call zfs_znode_delete(). Note that the O3X vnop_reclaim handler then has special code to not do anything with the vnode (zp->z_fastpath) but to only clear out the z_vnode and return.

zp->z_fastpath = B_TRUE;
if (vnode_recycle(vp) == 1) {
/* recycle/reclaim is done, so we can just release now */
zfs_znode_delete(zp, tx);
} else {
/* failed to recycle, so just place it on the unlinked list */
zp->z_fastpath = B_FALSE;
zfs_unlinked_add(zp, tx);
}

There is also a little special lock-handling in zfs_zinactive, since we can call it from inside of a vnode_create() which is called by ZFS with locks held. If this is the case, we do not attempt to acquire locks in zfs_zinactive.

=== snapshot mounts ===

There is no way to cause a mount in XNU kernel. None. At. All. Apple themselves cheated and added a static nfsmount() that we can not call. So instead, we have to jump through a whole bunch of
hoops to get there. We create a fake/virtual /dev/diskX entry for the snapshot. '''diskarbitrationd''' will wake up due to new disk, it will enter the probe phase, which includes calling
all the /System/Library/Filesystems/ bundles. Eventually, zfs.util is called and we reply affirmative. However, automount is disable here, as there is no way to specify a mountpoint with auto.
zfs.util will call DADiskMount to mount it to the correct directory.

This means we have a few more VNOPs in zfs_ctldir.c, as we have to reply with correct information to make mount successful. The first getattr will cause the mount attempt, the DADiskMount call will cause getattr to be called
and we have to pretend to have said entry.

=== spl_vn_rdwr vs vn_rdwr ===

There are two calls to vn_rdwr() in OSX's SPL. The '''spl_vn_rdwr()''' call needs to be used when zfs_onexit is in use. For example, dmu_send.c (zfs recv/send) and zfs_ioc_diff (zfs diff). The XNU implementation of
zfs_onexit (as in calls to '''getf''' and '''releasef''') need to place the internal XNU ''struct fileproc''' in the wrapper ''struct spl_fileproc'', so that '''spl_vn_rdwr()''' can use it to do IO.
This is the only way to do IO on a non-file based vnode (ie, pipe or socket). Other places that call vn_rdwr(), for example vdev_file.c, needs to call the regular vn_rdwr.

=== getattr ===

XNU has a whole bunch of items that it can ask for in vnop_getattr, including VA_NAME, which is used heavily by Finder (especially in the vfs_vget path). Care is needed here to return the correct name,
including for link (hard links) targets. VNOP_LOOKUP records the name that was used in the lookup, so that a following stat call (vnop_getattr) on the vnode will return the correct name if VA_NAME is requested.

Development

2015-08-15T05:09:33Z

101.175.67.14: /* Unit Test */

[[Category:O3X development]]
You should also familiarize yourself with the [[Project_roadmap|project roadmap]] so that you can put the technical details here in context.

== Kernel ==

=== Debugging with GDB ===

Dealing with [[Panic|panics]].

Apple's documentation: https://developer.apple.com/library/mac/documentation/Darwin/Conceptual/KEXTConcept/KEXTConceptDebugger/debug_tutorial.html

Boot target VM with

<syntaxhighlight lang="bash">
$ sudo nvram boot-args="-v keepsyms=y debug=0x144"
</syntaxhighlight>

Make it panic.

On your development machine, you will need the Kernel Debug Kit. Download it from Apple [https://developer.apple.com/downloads/index.action?q=Kernel%20Debug%20Kit here].

<syntaxhighlight lang="text">
$ gdb /Volumes/Kernelit/mach_kernel
(gdb) source /Volumes/KernelDebugKit/kgmacros
(gdb) target remote-kdp
(gdb) kdp-reattach 192.168.30.133 # obviously use the IP of your target / crashed VM
(gdb) showallkmods
</syntaxhighlight>

Find the addresses for ZFS and SPL modules.

<code>^Z</code> to suspend gdb, or, use another terminal

<syntaxhighlight lang="bash">
^Z
$ sudo kextutil -s /tmp -n \
-k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ \
../spl/module/spl/spl.kext/
</syntaxhighlight>

Then resume gdb, or go back to gdb terminal.
<syntaxhighlight lang="text">
$ fg
(gdb) set kext-symbol-file-path /tmp
(gdb) add-kext /tmp/spl.kext
(gdb) add-kext /tmp/zfs.kext
(gdb) bt
</syntaxhighlight>

=== Debugging with LLDB ===

<syntaxhighlight lang="bash">
$ echo "settings set target.load-script-from-symbol-file true" >> ~/.lldbinit
$ lldb /Volumes/KernelDebugKit/mach_kernel # From Yosemite, "/Library/Developer/KDKs/KDK_10.10_14B25.kdk/System/Library/Kernels/kernel"
(lldb) kdp-remote 192.168.30.146
(lldb) showallkmods
(lldb) addkext -F /tmp/spl.kext/Contents/MacOS/spl 0xffffff7f8ebb0000 (Address from showallkmods)
(lldb) addkext -F /tmp/zfs.kext/Contents/MacOS/zfs 0xffffff7f8ebbf000
</syntaxhighlight>

Then follow the guide for GDB above.

=== Non-panic ===

If you prefer to work in GDB, you can always panic a kernel with

<syntaxhighlight lang="bash">
$ sudo dtrace -w -n "BEGIN{ panic();}"
</syntaxhighlight>

But this was revealing:

<syntaxhighlight lang="bash">
$ sudo /usr/libexec/stackshot -i -f /tmp/stackshot.log
$ sudo symstacks.rb -f /tmp/stackshot.log -s -w /tmp/trace.txt
$ less /tmp/trace.txt
</syntaxhighlight>

Note that my hang is here:

<syntaxhighlight lang="text">
PID: 156
Process: zpool
Thread ID: 0x4e2
Thread state: 0x9 == TH_WAIT |TH_UNINT
Thread wait_event: 0xffffff8006608a6c
Kernel stack:
machine_switch_context (in mach_kernel) + 366 (0xffffff80002b3d3e)
0xffffff800022e711 (in mach_kernel) + 1281 (0xffffff800022e711)
thread_block_reason (in mach_kernel) + 300 (0xffffff800022d9dc)
lck_mtx_sleep (in mach_kernel) + 78 (0xffffff80002265ce)
0xffffff8000569ef6 (in mach_kernel) + 246 (0xffffff8000569ef6)
msleep (in mach_kernel) + 116 (0xffffff800056a2e4)
0xffffff7f80e52a76 (0xffffff7f80e52a76)
0xffffff7f80e53fae (0xffffff7f80e53fae)
0xffffff7f80e54173 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)
0xffffff7f80f2bb4e (0xffffff7f80f2bb4e)
0xffffff7f80f1a9b7 (0xffffff7f80f1a9b7)
0xffffff7f80f1b65f (0xffffff7f80f1b65f)
0xffffff7f80f042ee (0xffffff7f80f042ee)
0xffffff7f80f45c5b (0xffffff7f80f45c5b)
0xffffff7f80f4ce92 (0xffffff7f80f4ce92)
spec_ioctl (in mach_kernel) + 157 (0xffffff8000320bfd)
VNOP_IOCTL (in mach_kernel) + 244 (0xffffff8000311e84)
</syntaxhighlight>

It is a shame that it only shows the kernel symbols, and not inside SPL and ZFS, but we can ask it to load another sym file. (Alas, it cannot handle multiple symbols files. Fix this Apple.)

<syntaxhighlight lang="bash">
$ sudo kextstat #grab the addresses of SPL and ZFS again
$ sudo kextutil -s /tmp -n -k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ ../spl/module/spl/spl.kext/

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.spl.sym
0xffffff800056a2e4 (0xffffff800056a2e4)
spl_cv_wait (in net.lundman.spl.sym) + 54 (0xffffff7f80e52a76)
taskq_wait (in net.lundman.spl.sym) + 78 (0xffffff7f80e53fae)
taskq_destroy (in net.lundman.spl.sym) + 35 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.zfs.sym
0xffffff7f80e54173 (0xffffff7f80e54173)
vdev_open_children (in net.lundman.zfs.sym) + 336 (0xffffff7f80f1a870)
vdev_root_open (in net.lundman.zfs.sym) + 94 (0xffffff7f80f2bb4e)
vdev_open (in net.lundman.zfs.sym) + 311 (0xffffff7f80f1a9b7)
vdev_create (in net.lundman.zfs.sym) + 31 (0xffffff7f80f1b65f)
spa_create (in net.lundman.zfs.sym) + 878 (0xffffff7f80f042ee)
</syntaxhighlight>

Voilà!

=== Memory leaks ===

(Note that this section is only relevant to old O3X implementation that used the zones allocator - we now use our own kmem allocator).

In some cases, you may suspect memory issues, for instance if you saw the following panic:

<syntaxhighlight lang="text">
panic(cpu 1 caller 0xffffff80002438d8): "zalloc: \"kalloc.1024\" (100535 elements) retry fail 3, kfree_nop_count: 0"@/SourceCache/xnu/xnu-2050.7.9/osfmk/kern/zalloc.c:1826
</syntaxhighlight>

To debug this, you can attach GDB and use the zprint command:

<syntaxhighlight lang="text">
(gdb) zprint
ZONE COUNT TOT_SZ MAX_SZ ELT_SZ ALLOC_SZ TOT_ALLOC TOT_FREE NAME
0xffffff8002a89250 1620133 18c1000 22a3599 16 1000 125203838 123583705 kalloc.16 CX
0xffffff8006306c50 110335 35f000 4ce300 32 1000 13634985 13524650 kalloc.32 CX
0xffffff8006306a00 133584 82a000 e6a900 64 1000 26510120 26376536 kalloc.64 CX
0xffffff80063067b0 610090 4a84000 614f4c0 128 1000 50524515 49914425 kalloc.128 CX
0xffffff8006306560 1070398 121a2000 1b5e4d60 256 1000 72534632 71464234 kalloc.256 CX
0xffffff8006306310 399302 d423000 daf26b0 512 1000 39231204 38831902 kalloc.512 CX
0xffffff80063060c0 100404 6231000 c29e980 1024 1000 22949693 22849289 kalloc.1024 CX
0xffffff8006305e70 292 9a000 200000 2048 1000 77633725 77633433 kalloc.2048 CX
</syntaxhighlight>

In this case, kalloc.256 is suspect.

Reboot kernel with zlog=kalloc.256 on the command line, then we can use

<syntaxhighlight lang="text">
(gdb) findoldest
oldest record is at log index 393:

--------------- ALLOC 0xffffff803276ec00 : index 393 : ztime 21643824 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
and indeed, list any index

(gdb) zstack 394

--------------- ALLOC 0xffffff8032d60700 : index 394 : ztime 21648810 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
How many times was zfs_kmem_alloc involved in the leaked allocs?

(gdb) countpcs 0xffffff7f80e847df
occurred 3999 times in log (100% of records)
</syntaxhighlight>

At least we know it is our fault.

How many times is it arc_buf_alloc?

<syntaxhighlight lang="text">
(gdb) countpcs 0xffffff7f80e90649
occurred 2390 times in log (59% of records)
</syntaxhighlight>

=== Memory Architecture ===

ZFS is designed to aggressively cache filesystem data in main memory. The result of this caching can be significant filesystem performance improvement.

Selection of an allocator has been very challenging on OS X. In the last year we have evolved from:
* Direct call to OSMalloc - a very low level allocator in the kernel - rejected because of slow performance and because the minimum allocation size is one page (4k)
* Direct call to zalloc - the OS X zones allocator - rejected because only 25% of the machines memory can be accessed (50% under some circumstances), and because the result of exceeding this limit is a kernel panic with no other feedback mechanisms available.
* Direct call to bmalloc - bmalloc was a home grown slice allocator that allocated slices of memory from the kernel page allocator, and subdivided into smaller units of allocation to use by ZFS. This was quite successful but very space inefficient. Was used in O3X 1.2.7 and 1.3.0. At this stage we had no real response to memory pressure in the machine, so the total memory allocation to O3X was kept to 50% of the machine.
* Implementation of kmem and vmem allocators using code from Illumos. Provision of a memory pressure monitor mechanism - we are now able to allocate most of the machines memory to ZFS, and scale that back when the machine experiences memory pressure.

O3X has the Solaris Porting Layer (SPL). The SPL has long since provided the Illumos kmem.h API for use by ZFS. In O3X releases up to 1.3.0 the kmem implementation has been a stub that passes allocation requests to an underlying allocator. In O3X 1.3.0 we were still missing some key behaviours in the allocator - efficient lifecycle control of objects, and an effective response to memory pressure in the machine, and the allocator was not very space efficient because of metadata overheads in bmalloc. We were also not convinced that bmalloc represented the state of the art.

Our strategy was to determine how much of the Illumos allocator could be implemented on OS X. After a series of experiments where we implemented significant portions of the kmem code from illumos on top of bmalloc, we had learned enough to take the final step of essentially copying the entire kmem/vmem allocator stack from Illumos. Some portions of the kmem code have been disabled in kmem such as logging, and hot swap CPU support have been disabled due to architectural differences between OS X and Illumos.

By default kmem/vmem require a certain level of performance from the OS page allocator. It is easy to overwhelm the OS X page allocator. We tuned vmem to use 512Kb chunks of memory from the page allocator rather than the smaller allocations that vmem prefers. This is less than ideal as it reduces the ability for vmem to smoothly release memory to the page allocator when the machine is under pressure. While we have an adequately performing solution now, there will always be a tension between our allocator and OS X itself. OS X only provides minimal mechanisms to observe and respond to memory pressure in the machine, so we are somewhat limited in what can be achieved in this regard.

References:

Jeff Bonwicks paper - kmem and vmem implement this design. https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/

=== Detecting memory handling errors ===

The kmem allocator has an internal diagnostic mode. In diagnostic mode the allocator instruments heap memory with various features and markers as it is allocated and released by application code. These markers are checked as the program runs, and can determine when an application has exhibited one or more of a set of common memory handling errors. The debugging mode is disabled by default as it carries a significant performance penalty.

The memory handling errors that can be detected include:
* Modify after free
* Write past end of buffer
* Free of memory not managed by kmem
* Double free of memory
* Various other corruptions
* Freed size != allocated size
* Freed address != allocated address

Debug mode is enabled by compiling with the preprocessor symbol DEBUG defined. At a minimum spl-kmem.c and spl-osx.c need to see this define for the debugging features to be completely enabled.

In debugging mode you must choose whether kmem will log the fault and then panic, or just log. If you elect to panic, there is a very high chance that the full log message will not be stored in system.log before the OS halts, and you will have to connect to the machine with lldb and use the "systemlog" command to view the diagnostic message. If you elect to not panic, the program will continue to run despite the memory corruption, with undefined consequences. In spl-kmem.c set kmem_panic=0 to log, kmem_panic=1 to log+panic.

Example:

I modified spl_start() to include the following:

{
...
int *p;
for(int i=0; i<20;i++) {
p = (int*)spl_kmem_alloc(1024);
spl_kmem_free(p);
*p = 0;
}
}

With the debug mode enabled the following was logged:

14/08/2015 5:09:47.000 PM kernel[0]: SPL: kernel memory allocator: buffer modified after being freed
14/08/2015 5:09:47.000 PM kernel[0]: SPL: modification occurred at offset 0x0 (0xdeadbeefdeadbeef replaced by 0xdeadbeef00000000)
14/08/2015 5:09:47.000 PM kernel[0]: SPL: buffer=0xffffff887a87d980 bufctl=0xffffff887a7ad840 cache: kmem_alloc_1152
14/08/2015 5:09:47.000 PM kernel[0]: SPL: previous transaction on buffer 0xffffff887a87d980:
14/08/2015 5:09:47.000 PM kernel[0]: SPL: thread=0 time=T-0.000001383 slab=0xffffff887a5ffe68 cache: kmem_alloc_1152
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _kmem_cache_free_debug + 0x227
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _kmem_cache_free + 0x173
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _zfs_kmem_free + 0x2c4
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _spl_start + 0x2bb
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext5startEb + 0x40b
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext4loadEhhP7OSArray + 0xdd
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext4loadEhhP7OSArray + 0x3e1
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext22loadKextWithIdentifierEP8OSStringbbhhP7OSArray + 0xf2
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZNK11IOCatalogue14isModuleLoadedEP12OSDictionary + 0xe0
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN9IOService15probeCandidatesEP12OSOrderedSet + 0x2c4
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN9IOService14doServiceMatchEj + 0x22a
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN15_IOConfigThread4mainEPvi + 0x13c
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : _call_continuation + 0x17

You can clearly see the kind of memory corruption, the actual corrupted data, which kmem cache was involved, the relative time that the last action occurred, and the stack trace for the last action (which was a call to zfs_kmem_free()) - indicating that spl_start() was implicated in the fault. This event would have logged on the next allocated after the free and modify occurred.

=== Compiling to lower OSX versions ===

If you wish to compile O3X to a specific OSX version, in this case, compiling for 10.9 on a 10.10

SPL:
./configure --with-kernel-headers=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

ZFS:
./configure --with-kernelsrc=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

== Flamegraphs ==

Huge thanks to [http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html BrendanGregg] for so much of the dtrace magic.

dtrace the kernel while running a command:

<syntaxhighlight lang="bash">
$ sudo dtrace -x stackframes=100 -n 'profile-997 /arg0/ {
@[stack()] = count(); } tick-60s { exit(0); }' -o out.stacks
</syntaxhighlight>

It will run for 60 seconds.

Convert it to a flamegraph:

<syntaxhighlight lang="bash">
$ ./stackcollapse.pl out.stacks > out.folded
$ ./flamegraph.pl out.folded > out.svg
</syntaxhighlight>

This is <code>rsync -a /usr/ /BOOM/deletea/</code> running:

[[File:rsyncflamegraph.svg|thumb|rsync flamegraph]]

Or running '''Bonnie++''' in various stages:

<gallery mode="packed-hover">
File:create.svg|Create files in sequential order|alt=[[File:create.svg]]
File:stat.svg|Stat files in sequential order|alt=Stat files in sequential order
File:delete.svg|Delete files in sequential order|alt=Delete files in sequential order
</gallery>

[[File:VX_create.svg|thumb|Create files in sequential order]]

 

[[File:iozone.svg|thumb|IOzone flamegraph]]

[[File:iozoneX.svg|thumb|IOzone flamegraph (untrimmed)]]

 

------

== Unit Test ==

We have created an initial port of the standard ZFS test suite. It consists of a collection of scripts and miscellaneous utility programs and exercise the complete breadth and depth of the ZFS filesystem.

The tests are best run in a virtual machine with a baseline configured setup that has been captured in a snapshot. The tests should be run on the VM, and then due to the destructive nature of the tests, the VM should be reverted to the snapshot in preparation for future test runs. The tests take 2-4 hours to run depending on hardware setup.

=== Setup ===

The user zfs-test needs to be able to run sudo without issuing a password. Add the following to sudoers:

zfs-tests ALL=(ALL) NOPASSWD: ALL

The sudo root environment must be configured to pass certain enviroment variables from zfs-test through to the root environment. Add the following to sudoers:

Defaults env_keep += "__ZFS_MAIN_MOUNTPOINT_DIR"

Modify /etc/bashrc to contain

export __ZFS_MAIN_MOUNTPOINT_DIR="/"

If your development directory is ~you/Developer, clone zfs, spl and bfs-tests into that directory

# cd ~you/Developer
# git clone git@github.com:openzfsonosx/zfs-test.git
# git clone git@github.com:openzfsonosx/zfs.git
# git clone git@github.com:openzfsonosx/spl.git

Build the ZFS is built using the building from source instructions.

Ensure that /var/tmp has approximately 100GB of free space.

Create theee virtual hard drives of 10-20GB capacity each.

=== Run Test Suite ===

Setup the tests to run

# cd ~you/Developer/zfs-tests
# ./autogen.sh
# ./configure CC=clang CXX=clang++

Edit the generated Makefile, change the recipe for the test_hw target such that your three virtual disks are listed in the DISKS environment variable.

test_hw: test_verify test/zfs-tests/cmd
@KEEP="`zpool list -H -oname`" \
STF_TOOLS=$(abs_top_srcdir)/test/test-runner/stf \
STF_SUITE=$(abs_top_srcdir)/test/zfs-tests \
DISKS="/dev/disk3 /dev/disk1 /dev/disk2" \
su zfs-tests -c "ksh $(abs_top_srcdir)/test/zfs-tests/cmd/scripts/zfstest.ksh $$RUNFILE"

Run the test suite

sudo make test

=== Results ===

The test suite write summary pass/fail information to the console as they run. On completion of the test run summary statistics are written to the console.

Test log files are stored in /var/tmp/<testrun> (where test run is a unique looking number). In that directory there is a log file, and a directory per test. Within the test directory is detailed log information regarding the specific test.

In

== Iozone ==

Quick peek at how they compare, just to see how much we should improve it by.

HFS+ and ZFS were created on the same virtual disk in VMware. Of course, this is not ideal testing specs, but should serve as an indicator.

The pool was created with

<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 \
-O atime=off \
-O casesensitivity=insensitive \
-O normalization=formD \
BOOM /dev/disk1
</syntaxhighlight>

and the HFS+ file system was created with the standard OS X Disk Utility.app, with everything default (journaled, case-insensitive).

'''Iozone''' was run with standard automode:

<syntaxhighlight lang="bash">
sudo iozone -a -b outfile.xls
</syntaxhighlight>

[[File:hfs2_read.png|thumb|HFS+ read]]
[[File:hfs2_write.png|thumb|HFS+ write]]
[[File:zfs2_read.png|thumb|ZFS read]]
[[File:zfs2_write.png|thumb|ZFS write]]

As a guess, writes need to double, and reads need to triple.

=== VFS ===

[[VFS]]

== File-based zpools for testing==

* create 2 files (each 100 MB) to be used as block devices:
<syntaxhighlight lang="bash">
$ dd if=/dev/zero bs=1m count=100 of=vdisk1
$ dd if=/dev/zero bs=1m count=100 of=vdisk2
</syntaxhighlight>

* attach files as raw disk images:
<syntaxhighlight lang="bash">
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk1
/dev/disk2
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk2
/dev/disk3
</syntaxhighlight>

* create mirrored zpool:
<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 -O casesensitivity=insensitive -O normalization=formD tank mirror disk2 disk3
</syntaxhighlight>

* show zpool:
<syntaxhighlight lang="bash">
$ sudo zpool status
pool: tank
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
disk2 ONLINE 0 0 0
disk3 ONLINE 0 0 0

errors: No known data errors
</syntaxhighlight>

* test ZFS features, find bugs, ...

* export zpool:
<syntaxhighlight lang="bash">
$ sudo zpool export tank
</syntaxhighlight>

* detach raw images:
<syntaxhighlight lang="bash">
$ hdiutil detach disk2
"disk2" unmounted.
"disk2" ejected.
$ hdiutil detach disk3
"disk3" unmounted.
"disk3" ejected.
</syntaxhighlight>

== Platform differences ==

This section is an attempt to outline the differences from ZFS versions of other platforms, as compared to OS X. To assist developers new to the Apple platform, who wishes to assist, or understand, development of the O3X version.

=== Reclaim ===

One of the biggest hassles with OS X is the VFS layer's handling of reclaim. First it is worth noting that "struct vnode" is an opaque type, so we are not allowed to see, nor modify, the contents of a vnode.
(Of course, we could craft a mirror struct of vnode and tailor it to each OS X version where vnode changes. But that is rather hacky.)

Following that, the '''only''' place where you can set the '''vtype''' (VREG, VDIR), '''vdata''' (user pointer to hold the ZFS znode), '''vfsops''' (list of filesystem calls "vnops") etc, is '''only in calling vnode_create()'''.
So there is no way to "allocate an empty vnode, and set its values later". The FreeBSD method of pre-allocating vnodes, to avoid reclaim, can not be done.
ZFS will start a new dmu_tx, then call zfs_mknode which will eventually call vnode_create, so we can not do anything with dmu_tx in those vnops.

The problem is, if vnode_create decides to reclaim, it will do so directly, as the same thread. It will end up in vclean() which can call vnop_fsync, vnop_pageout, vnop_inactive and vnop_reclaim. The first three of these calls, we can
use the API call vnode_isrecycled() to detect if these vnops are called "the normal way", or from vclean. If we come from vclean, and the vnode is doomed, we will do as little as possible. We can not open a new TX, and
we can not use mutex locks (panic: locking against ourselves).

Nor is there any way to defer, or delay, a doomed vnode. If vnop_reclaim returns anything but 0, you find the lovely XNU code of
2205 if (VNOP_RECLAIM(vp, ctx))
2206 panic("vclean: cannot reclaim");
in vfs_subr.c

So, at the moment there is some extra logic in '''zfs_vnop_reclaim''' to handle that we might be re-entrant as the '''vnode_create''' thread.

exception = ((zp->z_sa_hdl != NULL) &&
zp->z_unlinked) ? B_TRUE : B_FALSE;
fastpath = zp->z_fastpath;

if both exception and fastpath are FALSE, we can call direct reclaim right there. As in those cases, no final dmu_tx is caused. Following
the zfs_rmnode->zfs_purgedir->zget and similar paths, exception is set to TRUE.

If exception is TRUE, we add the zp to the reclaim_list, and the separate reclaim_thread will call zfs_rmnode(zp). As a separate thread it can handle calling
dmu_tx.

If fastpath is TRUE, we do no more/nothing in zfs_vnop_reclaim. See below.

=== Fastpath vs Recycle ===

Another interesting aspect is that IllumOS has a delete fastpath. In zfs_remove, if it is detected that the znode can be "deleted_now", it marks the vnode as free and directly calls zfs_znode_delete(), if it can not, adds it to zfs_unlinked_add().

In OS X, there is no way to directly release a vnode. Ie, XNU always has full control of the vnodes. Even if you call vnode_recycle(), the vnode is not released '''until''' vnop_reclaim is called. The vnode can just be marked for later reclaim, but remain active (especially if you are racing against other threads using the same vnode). So in zfs_remove, we attempt to call vnode_recycle(), and only if this returns "1" do we know that vnop_reclaim was called, and we can directly call zfs_znode_delete(). Note that the O3X vnop_reclaim handler then has special code to not do anything with the vnode (zp->z_fastpath) but to only clear out the z_vnode and return.

zp->z_fastpath = B_TRUE;
if (vnode_recycle(vp) == 1) {
/* recycle/reclaim is done, so we can just release now */
zfs_znode_delete(zp, tx);
} else {
/* failed to recycle, so just place it on the unlinked list */
zp->z_fastpath = B_FALSE;
zfs_unlinked_add(zp, tx);
}

There is also a little special lock-handling in zfs_zinactive, since we can call it from inside of a vnode_create() which is called by ZFS with locks held. If this is the case, we do not attempt to acquire locks in zfs_zinactive.

=== snapshot mounts ===

There is no way to cause a mount in XNU kernel. None. At. All. Apple themselves cheated and added a static nfsmount() that we can not call. So instead, we have to jump through a whole bunch of
hoops to get there. We create a fake/virtual /dev/diskX entry for the snapshot. '''diskarbitrationd''' will wake up due to new disk, it will enter the probe phase, which includes calling
all the /System/Library/Filesystems/ bundles. Eventually, zfs.util is called and we reply affirmative. However, automount is disable here, as there is no way to specify a mountpoint with auto.
zfs.util will call DADiskMount to mount it to the correct directory.

This means we have a few more VNOPs in zfs_ctldir.c, as we have to reply with correct information to make mount successful. The first getattr will cause the mount attempt, the DADiskMount call will cause getattr to be called
and we have to pretend to have said entry.

=== spl_vn_rdwr vs vn_rdwr ===

There are two calls to vn_rdwr() in OSX's SPL. The '''spl_vn_rdwr()''' call needs to be used when zfs_onexit is in use. For example, dmu_send.c (zfs recv/send) and zfs_ioc_diff (zfs diff). The XNU implementation of
zfs_onexit (as in calls to '''getf''' and '''releasef''') need to place the internal XNU ''struct fileproc''' in the wrapper ''struct spl_fileproc'', so that '''spl_vn_rdwr()''' can use it to do IO.
This is the only way to do IO on a non-file based vnode (ie, pipe or socket). Other places that call vn_rdwr(), for example vdev_file.c, needs to call the regular vn_rdwr.

=== getattr ===

XNU has a whole bunch of items that it can ask for in vnop_getattr, including VA_NAME, which is used heavily by Finder (especially in the vfs_vget path). Care is needed here to return the correct name,
including for link (hard links) targets. VNOP_LOOKUP records the name that was used in the lookup, so that a following stat call (vnop_getattr) on the vnode will return the correct name if VA_NAME is requested.

Development

2015-08-15T05:00:23Z

101.175.67.14:

[[Category:O3X development]]
You should also familiarize yourself with the [[Project_roadmap|project roadmap]] so that you can put the technical details here in context.

== Kernel ==

=== Debugging with GDB ===

Dealing with [[Panic|panics]].

Apple's documentation: https://developer.apple.com/library/mac/documentation/Darwin/Conceptual/KEXTConcept/KEXTConceptDebugger/debug_tutorial.html

Boot target VM with

<syntaxhighlight lang="bash">
$ sudo nvram boot-args="-v keepsyms=y debug=0x144"
</syntaxhighlight>

Make it panic.

On your development machine, you will need the Kernel Debug Kit. Download it from Apple [https://developer.apple.com/downloads/index.action?q=Kernel%20Debug%20Kit here].

<syntaxhighlight lang="text">
$ gdb /Volumes/Kernelit/mach_kernel
(gdb) source /Volumes/KernelDebugKit/kgmacros
(gdb) target remote-kdp
(gdb) kdp-reattach 192.168.30.133 # obviously use the IP of your target / crashed VM
(gdb) showallkmods
</syntaxhighlight>

Find the addresses for ZFS and SPL modules.

<code>^Z</code> to suspend gdb, or, use another terminal

<syntaxhighlight lang="bash">
^Z
$ sudo kextutil -s /tmp -n \
-k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ \
../spl/module/spl/spl.kext/
</syntaxhighlight>

Then resume gdb, or go back to gdb terminal.
<syntaxhighlight lang="text">
$ fg
(gdb) set kext-symbol-file-path /tmp
(gdb) add-kext /tmp/spl.kext
(gdb) add-kext /tmp/zfs.kext
(gdb) bt
</syntaxhighlight>

=== Debugging with LLDB ===

<syntaxhighlight lang="bash">
$ echo "settings set target.load-script-from-symbol-file true" >> ~/.lldbinit
$ lldb /Volumes/KernelDebugKit/mach_kernel # From Yosemite, "/Library/Developer/KDKs/KDK_10.10_14B25.kdk/System/Library/Kernels/kernel"
(lldb) kdp-remote 192.168.30.146
(lldb) showallkmods
(lldb) addkext -F /tmp/spl.kext/Contents/MacOS/spl 0xffffff7f8ebb0000 (Address from showallkmods)
(lldb) addkext -F /tmp/zfs.kext/Contents/MacOS/zfs 0xffffff7f8ebbf000
</syntaxhighlight>

Then follow the guide for GDB above.

=== Non-panic ===

If you prefer to work in GDB, you can always panic a kernel with

<syntaxhighlight lang="bash">
$ sudo dtrace -w -n "BEGIN{ panic();}"
</syntaxhighlight>

But this was revealing:

<syntaxhighlight lang="bash">
$ sudo /usr/libexec/stackshot -i -f /tmp/stackshot.log
$ sudo symstacks.rb -f /tmp/stackshot.log -s -w /tmp/trace.txt
$ less /tmp/trace.txt
</syntaxhighlight>

Note that my hang is here:

<syntaxhighlight lang="text">
PID: 156
Process: zpool
Thread ID: 0x4e2
Thread state: 0x9 == TH_WAIT |TH_UNINT
Thread wait_event: 0xffffff8006608a6c
Kernel stack:
machine_switch_context (in mach_kernel) + 366 (0xffffff80002b3d3e)
0xffffff800022e711 (in mach_kernel) + 1281 (0xffffff800022e711)
thread_block_reason (in mach_kernel) + 300 (0xffffff800022d9dc)
lck_mtx_sleep (in mach_kernel) + 78 (0xffffff80002265ce)
0xffffff8000569ef6 (in mach_kernel) + 246 (0xffffff8000569ef6)
msleep (in mach_kernel) + 116 (0xffffff800056a2e4)
0xffffff7f80e52a76 (0xffffff7f80e52a76)
0xffffff7f80e53fae (0xffffff7f80e53fae)
0xffffff7f80e54173 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)
0xffffff7f80f2bb4e (0xffffff7f80f2bb4e)
0xffffff7f80f1a9b7 (0xffffff7f80f1a9b7)
0xffffff7f80f1b65f (0xffffff7f80f1b65f)
0xffffff7f80f042ee (0xffffff7f80f042ee)
0xffffff7f80f45c5b (0xffffff7f80f45c5b)
0xffffff7f80f4ce92 (0xffffff7f80f4ce92)
spec_ioctl (in mach_kernel) + 157 (0xffffff8000320bfd)
VNOP_IOCTL (in mach_kernel) + 244 (0xffffff8000311e84)
</syntaxhighlight>

It is a shame that it only shows the kernel symbols, and not inside SPL and ZFS, but we can ask it to load another sym file. (Alas, it cannot handle multiple symbols files. Fix this Apple.)

<syntaxhighlight lang="bash">
$ sudo kextstat #grab the addresses of SPL and ZFS again
$ sudo kextutil -s /tmp -n -k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ ../spl/module/spl/spl.kext/

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.spl.sym
0xffffff800056a2e4 (0xffffff800056a2e4)
spl_cv_wait (in net.lundman.spl.sym) + 54 (0xffffff7f80e52a76)
taskq_wait (in net.lundman.spl.sym) + 78 (0xffffff7f80e53fae)
taskq_destroy (in net.lundman.spl.sym) + 35 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.zfs.sym
0xffffff7f80e54173 (0xffffff7f80e54173)
vdev_open_children (in net.lundman.zfs.sym) + 336 (0xffffff7f80f1a870)
vdev_root_open (in net.lundman.zfs.sym) + 94 (0xffffff7f80f2bb4e)
vdev_open (in net.lundman.zfs.sym) + 311 (0xffffff7f80f1a9b7)
vdev_create (in net.lundman.zfs.sym) + 31 (0xffffff7f80f1b65f)
spa_create (in net.lundman.zfs.sym) + 878 (0xffffff7f80f042ee)
</syntaxhighlight>

Voilà!

=== Memory leaks ===

(Note that this section is only relevant to old O3X implementation that used the zones allocator - we now use our own kmem allocator).

In some cases, you may suspect memory issues, for instance if you saw the following panic:

<syntaxhighlight lang="text">
panic(cpu 1 caller 0xffffff80002438d8): "zalloc: \"kalloc.1024\" (100535 elements) retry fail 3, kfree_nop_count: 0"@/SourceCache/xnu/xnu-2050.7.9/osfmk/kern/zalloc.c:1826
</syntaxhighlight>

To debug this, you can attach GDB and use the zprint command:

<syntaxhighlight lang="text">
(gdb) zprint
ZONE COUNT TOT_SZ MAX_SZ ELT_SZ ALLOC_SZ TOT_ALLOC TOT_FREE NAME
0xffffff8002a89250 1620133 18c1000 22a3599 16 1000 125203838 123583705 kalloc.16 CX
0xffffff8006306c50 110335 35f000 4ce300 32 1000 13634985 13524650 kalloc.32 CX
0xffffff8006306a00 133584 82a000 e6a900 64 1000 26510120 26376536 kalloc.64 CX
0xffffff80063067b0 610090 4a84000 614f4c0 128 1000 50524515 49914425 kalloc.128 CX
0xffffff8006306560 1070398 121a2000 1b5e4d60 256 1000 72534632 71464234 kalloc.256 CX
0xffffff8006306310 399302 d423000 daf26b0 512 1000 39231204 38831902 kalloc.512 CX
0xffffff80063060c0 100404 6231000 c29e980 1024 1000 22949693 22849289 kalloc.1024 CX
0xffffff8006305e70 292 9a000 200000 2048 1000 77633725 77633433 kalloc.2048 CX
</syntaxhighlight>

In this case, kalloc.256 is suspect.

Reboot kernel with zlog=kalloc.256 on the command line, then we can use

<syntaxhighlight lang="text">
(gdb) findoldest
oldest record is at log index 393:

--------------- ALLOC 0xffffff803276ec00 : index 393 : ztime 21643824 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
and indeed, list any index

(gdb) zstack 394

--------------- ALLOC 0xffffff8032d60700 : index 394 : ztime 21648810 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
How many times was zfs_kmem_alloc involved in the leaked allocs?

(gdb) countpcs 0xffffff7f80e847df
occurred 3999 times in log (100% of records)
</syntaxhighlight>

At least we know it is our fault.

How many times is it arc_buf_alloc?

<syntaxhighlight lang="text">
(gdb) countpcs 0xffffff7f80e90649
occurred 2390 times in log (59% of records)
</syntaxhighlight>

=== Memory Architecture ===

ZFS is designed to aggressively cache filesystem data in main memory. The result of this caching can be significant filesystem performance improvement.

Selection of an allocator has been very challenging on OS X. In the last year we have evolved from:
* Direct call to OSMalloc - a very low level allocator in the kernel - rejected because of slow performance and because the minimum allocation size is one page (4k)
* Direct call to zalloc - the OS X zones allocator - rejected because only 25% of the machines memory can be accessed (50% under some circumstances), and because the result of exceeding this limit is a kernel panic with no other feedback mechanisms available.
* Direct call to bmalloc - bmalloc was a home grown slice allocator that allocated slices of memory from the kernel page allocator, and subdivided into smaller units of allocation to use by ZFS. This was quite successful but very space inefficient. Was used in O3X 1.2.7 and 1.3.0. At this stage we had no real response to memory pressure in the machine, so the total memory allocation to O3X was kept to 50% of the machine.
* Implementation of kmem and vmem allocators using code from Illumos. Provision of a memory pressure monitor mechanism - we are now able to allocate most of the machines memory to ZFS, and scale that back when the machine experiences memory pressure.

O3X has the Solaris Porting Layer (SPL). The SPL has long since provided the Illumos kmem.h API for use by ZFS. In O3X releases up to 1.3.0 the kmem implementation has been a stub that passes allocation requests to an underlying allocator. In O3X 1.3.0 we were still missing some key behaviours in the allocator - efficient lifecycle control of objects, and an effective response to memory pressure in the machine, and the allocator was not very space efficient because of metadata overheads in bmalloc. We were also not convinced that bmalloc represented the state of the art.

Our strategy was to determine how much of the Illumos allocator could be implemented on OS X. After a series of experiments where we implemented significant portions of the kmem code from illumos on top of bmalloc, we had learned enough to take the final step of essentially copying the entire kmem/vmem allocator stack from Illumos. Some portions of the kmem code have been disabled in kmem such as logging, and hot swap CPU support have been disabled due to architectural differences between OS X and Illumos.

By default kmem/vmem require a certain level of performance from the OS page allocator. It is easy to overwhelm the OS X page allocator. We tuned vmem to use 512Kb chunks of memory from the page allocator rather than the smaller allocations that vmem prefers. This is less than ideal as it reduces the ability for vmem to smoothly release memory to the page allocator when the machine is under pressure. While we have an adequately performing solution now, there will always be a tension between our allocator and OS X itself. OS X only provides minimal mechanisms to observe and respond to memory pressure in the machine, so we are somewhat limited in what can be achieved in this regard.

References:

Jeff Bonwicks paper - kmem and vmem implement this design. https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/

=== Detecting memory handling errors ===

The kmem allocator has an internal diagnostic mode. In diagnostic mode the allocator instruments heap memory with various features and markers as it is allocated and released by application code. These markers are checked as the program runs, and can determine when an application has exhibited one or more of a set of common memory handling errors. The debugging mode is disabled by default as it carries a significant performance penalty.

The memory handling errors that can be detected include:
* Modify after free
* Write past end of buffer
* Free of memory not managed by kmem
* Double free of memory
* Various other corruptions
* Freed size != allocated size
* Freed address != allocated address

Debug mode is enabled by compiling with the preprocessor symbol DEBUG defined. At a minimum spl-kmem.c and spl-osx.c need to see this define for the debugging features to be completely enabled.

In debugging mode you must choose whether kmem will log the fault and then panic, or just log. If you elect to panic, there is a very high chance that the full log message will not be stored in system.log before the OS halts, and you will have to connect to the machine with lldb and use the "systemlog" command to view the diagnostic message. If you elect to not panic, the program will continue to run despite the memory corruption, with undefined consequences. In spl-kmem.c set kmem_panic=0 to log, kmem_panic=1 to log+panic.

Example:

I modified spl_start() to include the following:

{
...
int *p;
for(int i=0; i<20;i++) {
p = (int*)spl_kmem_alloc(1024);
spl_kmem_free(p);
*p = 0;
}
}

With the debug mode enabled the following was logged:

14/08/2015 5:09:47.000 PM kernel[0]: SPL: kernel memory allocator: buffer modified after being freed
14/08/2015 5:09:47.000 PM kernel[0]: SPL: modification occurred at offset 0x0 (0xdeadbeefdeadbeef replaced by 0xdeadbeef00000000)
14/08/2015 5:09:47.000 PM kernel[0]: SPL: buffer=0xffffff887a87d980 bufctl=0xffffff887a7ad840 cache: kmem_alloc_1152
14/08/2015 5:09:47.000 PM kernel[0]: SPL: previous transaction on buffer 0xffffff887a87d980:
14/08/2015 5:09:47.000 PM kernel[0]: SPL: thread=0 time=T-0.000001383 slab=0xffffff887a5ffe68 cache: kmem_alloc_1152
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _kmem_cache_free_debug + 0x227
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _kmem_cache_free + 0x173
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _zfs_kmem_free + 0x2c4
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _spl_start + 0x2bb
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext5startEb + 0x40b
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext4loadEhhP7OSArray + 0xdd
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext4loadEhhP7OSArray + 0x3e1
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext22loadKextWithIdentifierEP8OSStringbbhhP7OSArray + 0xf2
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZNK11IOCatalogue14isModuleLoadedEP12OSDictionary + 0xe0
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN9IOService15probeCandidatesEP12OSOrderedSet + 0x2c4
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN9IOService14doServiceMatchEj + 0x22a
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN15_IOConfigThread4mainEPvi + 0x13c
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : _call_continuation + 0x17

You can clearly see the kind of memory corruption, the actual corrupted data, which kmem cache was involved, the relative time that the last action occurred, and the stack trace for the last action (which was a call to zfs_kmem_free()) - indicating that spl_start() was implicated in the fault. This event would have logged on the next allocated after the free and modify occurred.

=== Compiling to lower OSX versions ===

If you wish to compile O3X to a specific OSX version, in this case, compiling for 10.9 on a 10.10

SPL:
./configure --with-kernel-headers=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

ZFS:
./configure --with-kernelsrc=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

== Flamegraphs ==

Huge thanks to [http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html BrendanGregg] for so much of the dtrace magic.

dtrace the kernel while running a command:

<syntaxhighlight lang="bash">
$ sudo dtrace -x stackframes=100 -n 'profile-997 /arg0/ {
@[stack()] = count(); } tick-60s { exit(0); }' -o out.stacks
</syntaxhighlight>

It will run for 60 seconds.

Convert it to a flamegraph:

<syntaxhighlight lang="bash">
$ ./stackcollapse.pl out.stacks > out.folded
$ ./flamegraph.pl out.folded > out.svg
</syntaxhighlight>

This is <code>rsync -a /usr/ /BOOM/deletea/</code> running:

[[File:rsyncflamegraph.svg|thumb|rsync flamegraph]]

Or running '''Bonnie++''' in various stages:

<gallery mode="packed-hover">
File:create.svg|Create files in sequential order|alt=[[File:create.svg]]
File:stat.svg|Stat files in sequential order|alt=Stat files in sequential order
File:delete.svg|Delete files in sequential order|alt=Delete files in sequential order
</gallery>

[[File:VX_create.svg|thumb|Create files in sequential order]]

 

[[File:iozone.svg|thumb|IOzone flamegraph]]

[[File:iozoneX.svg|thumb|IOzone flamegraph (untrimmed)]]

 

------

== Unit Test ==

We have created an initial port of the standard ZFS test suite. It consists of a collection of scripts and miscellaneous utility programs and exercise the complete breadth and depth of the ZFS filesystem.

The tests are best run in a virtual machine with a baseline configured setup that has been captured in a snapshot. The tests should be run on the VM, and then due to the destructive nature of the tests, the VM should be reverted to the snapshot in preparation for future test runs. The tests take 2-4 hours to run depending on hardware setup.

=== Setup ===

The user zfs-test needs to be able to run sudo without issuing a password. Add the following to sudoers:

zfs-tests ALL=(ALL) NOPASSWD: ALL

The sudo root environment must be configured to pass certain enviroment variables from zfs-test through to the root environment. Add the following to sudoers:

Defaults env_keep += "__ZFS_MAIN_MOUNTPOINT_DIR"

Modify /etc/bashrc to contain

export __ZFS_MAIN_MOUNTPOINT_DIR="/"

If your development directory is ~you/Developer, clone zfs, spl and bfs-tests into that directory

# cd ~you/Developer
# git clone git@github.com:openzfsonosx/zfs-test.git
# git clone git@github.com:openzfsonosx/zfs.git
# git clone git@github.com:openzfsonosx/spl.git

Build the ZFS is built using the building from source instructions.

Ensure that /var/tmp has approximately 100GB of free space.

Create theee virtual hard drives of 10-20GB capacity each.

=== Run Test Suite ===

Setup the tests to run

# cd ~you/Developer/zfs-tests
# ./autogen.sh
# ./configure CC=clang CXX=clang++

Edit the generated Makefile, change the recipe for the test_hw target such that your three virtual disks are listed in the DISKS environment variable.

test_hw: test_verify test/zfs-tests/cmd
@KEEP="`zpool list -H -oname`" \
STF_TOOLS=$(abs_top_srcdir)/test/test-runner/stf \
STF_SUITE=$(abs_top_srcdir)/test/zfs-tests \
DISKS="/dev/disk3 /dev/disk1 /dev/disk2" \
su zfs-tests -c "ksh $(abs_top_srcdir)/test/zfs-tests/cmd/scripts/zfstest.ksh $$RUNFILE"

Run the test suite

sudo make test

== Iozone ==

Quick peek at how they compare, just to see how much we should improve it by.

HFS+ and ZFS were created on the same virtual disk in VMware. Of course, this is not ideal testing specs, but should serve as an indicator.

The pool was created with

<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 \
-O atime=off \
-O casesensitivity=insensitive \
-O normalization=formD \
BOOM /dev/disk1
</syntaxhighlight>

and the HFS+ file system was created with the standard OS X Disk Utility.app, with everything default (journaled, case-insensitive).

'''Iozone''' was run with standard automode:

<syntaxhighlight lang="bash">
sudo iozone -a -b outfile.xls
</syntaxhighlight>

[[File:hfs2_read.png|thumb|HFS+ read]]
[[File:hfs2_write.png|thumb|HFS+ write]]
[[File:zfs2_read.png|thumb|ZFS read]]
[[File:zfs2_write.png|thumb|ZFS write]]

As a guess, writes need to double, and reads need to triple.

=== VFS ===

[[VFS]]

== File-based zpools for testing==

* create 2 files (each 100 MB) to be used as block devices:
<syntaxhighlight lang="bash">
$ dd if=/dev/zero bs=1m count=100 of=vdisk1
$ dd if=/dev/zero bs=1m count=100 of=vdisk2
</syntaxhighlight>

* attach files as raw disk images:
<syntaxhighlight lang="bash">
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk1
/dev/disk2
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk2
/dev/disk3
</syntaxhighlight>

* create mirrored zpool:
<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 -O casesensitivity=insensitive -O normalization=formD tank mirror disk2 disk3
</syntaxhighlight>

* show zpool:
<syntaxhighlight lang="bash">
$ sudo zpool status
pool: tank
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
disk2 ONLINE 0 0 0
disk3 ONLINE 0 0 0

errors: No known data errors
</syntaxhighlight>

* test ZFS features, find bugs, ...

* export zpool:
<syntaxhighlight lang="bash">
$ sudo zpool export tank
</syntaxhighlight>

* detach raw images:
<syntaxhighlight lang="bash">
$ hdiutil detach disk2
"disk2" unmounted.
"disk2" ejected.
$ hdiutil detach disk3
"disk3" unmounted.
"disk3" ejected.
</syntaxhighlight>

== Platform differences ==

This section is an attempt to outline the differences from ZFS versions of other platforms, as compared to OS X. To assist developers new to the Apple platform, who wishes to assist, or understand, development of the O3X version.

=== Reclaim ===

One of the biggest hassles with OS X is the VFS layer's handling of reclaim. First it is worth noting that "struct vnode" is an opaque type, so we are not allowed to see, nor modify, the contents of a vnode.
(Of course, we could craft a mirror struct of vnode and tailor it to each OS X version where vnode changes. But that is rather hacky.)

Following that, the '''only''' place where you can set the '''vtype''' (VREG, VDIR), '''vdata''' (user pointer to hold the ZFS znode), '''vfsops''' (list of filesystem calls "vnops") etc, is '''only in calling vnode_create()'''.
So there is no way to "allocate an empty vnode, and set its values later". The FreeBSD method of pre-allocating vnodes, to avoid reclaim, can not be done.
ZFS will start a new dmu_tx, then call zfs_mknode which will eventually call vnode_create, so we can not do anything with dmu_tx in those vnops.

The problem is, if vnode_create decides to reclaim, it will do so directly, as the same thread. It will end up in vclean() which can call vnop_fsync, vnop_pageout, vnop_inactive and vnop_reclaim. The first three of these calls, we can
use the API call vnode_isrecycled() to detect if these vnops are called "the normal way", or from vclean. If we come from vclean, and the vnode is doomed, we will do as little as possible. We can not open a new TX, and
we can not use mutex locks (panic: locking against ourselves).

Nor is there any way to defer, or delay, a doomed vnode. If vnop_reclaim returns anything but 0, you find the lovely XNU code of
2205 if (VNOP_RECLAIM(vp, ctx))
2206 panic("vclean: cannot reclaim");
in vfs_subr.c

So, at the moment there is some extra logic in '''zfs_vnop_reclaim''' to handle that we might be re-entrant as the '''vnode_create''' thread.

exception = ((zp->z_sa_hdl != NULL) &&
zp->z_unlinked) ? B_TRUE : B_FALSE;
fastpath = zp->z_fastpath;

if both exception and fastpath are FALSE, we can call direct reclaim right there. As in those cases, no final dmu_tx is caused. Following
the zfs_rmnode->zfs_purgedir->zget and similar paths, exception is set to TRUE.

If exception is TRUE, we add the zp to the reclaim_list, and the separate reclaim_thread will call zfs_rmnode(zp). As a separate thread it can handle calling
dmu_tx.

If fastpath is TRUE, we do no more/nothing in zfs_vnop_reclaim. See below.

=== Fastpath vs Recycle ===

Another interesting aspect is that IllumOS has a delete fastpath. In zfs_remove, if it is detected that the znode can be "deleted_now", it marks the vnode as free and directly calls zfs_znode_delete(), if it can not, adds it to zfs_unlinked_add().

In OS X, there is no way to directly release a vnode. Ie, XNU always has full control of the vnodes. Even if you call vnode_recycle(), the vnode is not released '''until''' vnop_reclaim is called. The vnode can just be marked for later reclaim, but remain active (especially if you are racing against other threads using the same vnode). So in zfs_remove, we attempt to call vnode_recycle(), and only if this returns "1" do we know that vnop_reclaim was called, and we can directly call zfs_znode_delete(). Note that the O3X vnop_reclaim handler then has special code to not do anything with the vnode (zp->z_fastpath) but to only clear out the z_vnode and return.

zp->z_fastpath = B_TRUE;
if (vnode_recycle(vp) == 1) {
/* recycle/reclaim is done, so we can just release now */
zfs_znode_delete(zp, tx);
} else {
/* failed to recycle, so just place it on the unlinked list */
zp->z_fastpath = B_FALSE;
zfs_unlinked_add(zp, tx);
}

There is also a little special lock-handling in zfs_zinactive, since we can call it from inside of a vnode_create() which is called by ZFS with locks held. If this is the case, we do not attempt to acquire locks in zfs_zinactive.

=== snapshot mounts ===

There is no way to cause a mount in XNU kernel. None. At. All. Apple themselves cheated and added a static nfsmount() that we can not call. So instead, we have to jump through a whole bunch of
hoops to get there. We create a fake/virtual /dev/diskX entry for the snapshot. '''diskarbitrationd''' will wake up due to new disk, it will enter the probe phase, which includes calling
all the /System/Library/Filesystems/ bundles. Eventually, zfs.util is called and we reply affirmative. However, automount is disable here, as there is no way to specify a mountpoint with auto.
zfs.util will call DADiskMount to mount it to the correct directory.

This means we have a few more VNOPs in zfs_ctldir.c, as we have to reply with correct information to make mount successful. The first getattr will cause the mount attempt, the DADiskMount call will cause getattr to be called
and we have to pretend to have said entry.

=== spl_vn_rdwr vs vn_rdwr ===

There are two calls to vn_rdwr() in OSX's SPL. The '''spl_vn_rdwr()''' call needs to be used when zfs_onexit is in use. For example, dmu_send.c (zfs recv/send) and zfs_ioc_diff (zfs diff). The XNU implementation of
zfs_onexit (as in calls to '''getf''' and '''releasef''') need to place the internal XNU ''struct fileproc''' in the wrapper ''struct spl_fileproc'', so that '''spl_vn_rdwr()''' can use it to do IO.
This is the only way to do IO on a non-file based vnode (ie, pipe or socket). Other places that call vn_rdwr(), for example vdev_file.c, needs to call the regular vn_rdwr.

=== getattr ===

XNU has a whole bunch of items that it can ask for in vnop_getattr, including VA_NAME, which is used heavily by Finder (especially in the vfs_vget path). Care is needed here to return the correct name,
including for link (hard links) targets. VNOP_LOOKUP records the name that was used in the lookup, so that a following stat call (vnop_getattr) on the vnode will return the correct name if VA_NAME is requested.

Development

2015-08-15T01:00:34Z

101.175.67.14: /* Detecting memory handling errors */

[[Category:O3X development]]
You should also familiarize yourself with the [[Project_roadmap|project roadmap]] so that you can put the technical details here in context.

== Kernel ==

=== Debugging with GDB ===

Dealing with [[Panic|panics]].

Apple's documentation: https://developer.apple.com/library/mac/documentation/Darwin/Conceptual/KEXTConcept/KEXTConceptDebugger/debug_tutorial.html

Boot target VM with

<syntaxhighlight lang="bash">
$ sudo nvram boot-args="-v keepsyms=y debug=0x144"
</syntaxhighlight>

Make it panic.

On your development machine, you will need the Kernel Debug Kit. Download it from Apple [https://developer.apple.com/downloads/index.action?q=Kernel%20Debug%20Kit here].

<syntaxhighlight lang="text">
$ gdb /Volumes/Kernelit/mach_kernel
(gdb) source /Volumes/KernelDebugKit/kgmacros
(gdb) target remote-kdp
(gdb) kdp-reattach 192.168.30.133 # obviously use the IP of your target / crashed VM
(gdb) showallkmods
</syntaxhighlight>

Find the addresses for ZFS and SPL modules.

<code>^Z</code> to suspend gdb, or, use another terminal

<syntaxhighlight lang="bash">
^Z
$ sudo kextutil -s /tmp -n \
-k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ \
../spl/module/spl/spl.kext/
</syntaxhighlight>

Then resume gdb, or go back to gdb terminal.
<syntaxhighlight lang="text">
$ fg
(gdb) set kext-symbol-file-path /tmp
(gdb) add-kext /tmp/spl.kext
(gdb) add-kext /tmp/zfs.kext
(gdb) bt
</syntaxhighlight>

=== Debugging with LLDB ===

<syntaxhighlight lang="bash">
$ echo "settings set target.load-script-from-symbol-file true" >> ~/.lldbinit
$ lldb /Volumes/KernelDebugKit/mach_kernel # From Yosemite, "/Library/Developer/KDKs/KDK_10.10_14B25.kdk/System/Library/Kernels/kernel"
(lldb) kdp-remote 192.168.30.146
(lldb) showallkmods
(lldb) addkext -F /tmp/spl.kext/Contents/MacOS/spl 0xffffff7f8ebb0000 (Address from showallkmods)
(lldb) addkext -F /tmp/zfs.kext/Contents/MacOS/zfs 0xffffff7f8ebbf000
</syntaxhighlight>

Then follow the guide for GDB above.

=== Non-panic ===

If you prefer to work in GDB, you can always panic a kernel with

<syntaxhighlight lang="bash">
$ sudo dtrace -w -n "BEGIN{ panic();}"
</syntaxhighlight>

But this was revealing:

<syntaxhighlight lang="bash">
$ sudo /usr/libexec/stackshot -i -f /tmp/stackshot.log
$ sudo symstacks.rb -f /tmp/stackshot.log -s -w /tmp/trace.txt
$ less /tmp/trace.txt
</syntaxhighlight>

Note that my hang is here:

<syntaxhighlight lang="text">
PID: 156
Process: zpool
Thread ID: 0x4e2
Thread state: 0x9 == TH_WAIT |TH_UNINT
Thread wait_event: 0xffffff8006608a6c
Kernel stack:
machine_switch_context (in mach_kernel) + 366 (0xffffff80002b3d3e)
0xffffff800022e711 (in mach_kernel) + 1281 (0xffffff800022e711)
thread_block_reason (in mach_kernel) + 300 (0xffffff800022d9dc)
lck_mtx_sleep (in mach_kernel) + 78 (0xffffff80002265ce)
0xffffff8000569ef6 (in mach_kernel) + 246 (0xffffff8000569ef6)
msleep (in mach_kernel) + 116 (0xffffff800056a2e4)
0xffffff7f80e52a76 (0xffffff7f80e52a76)
0xffffff7f80e53fae (0xffffff7f80e53fae)
0xffffff7f80e54173 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)
0xffffff7f80f2bb4e (0xffffff7f80f2bb4e)
0xffffff7f80f1a9b7 (0xffffff7f80f1a9b7)
0xffffff7f80f1b65f (0xffffff7f80f1b65f)
0xffffff7f80f042ee (0xffffff7f80f042ee)
0xffffff7f80f45c5b (0xffffff7f80f45c5b)
0xffffff7f80f4ce92 (0xffffff7f80f4ce92)
spec_ioctl (in mach_kernel) + 157 (0xffffff8000320bfd)
VNOP_IOCTL (in mach_kernel) + 244 (0xffffff8000311e84)
</syntaxhighlight>

It is a shame that it only shows the kernel symbols, and not inside SPL and ZFS, but we can ask it to load another sym file. (Alas, it cannot handle multiple symbols files. Fix this Apple.)

<syntaxhighlight lang="bash">
$ sudo kextstat #grab the addresses of SPL and ZFS again
$ sudo kextutil -s /tmp -n -k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ ../spl/module/spl/spl.kext/

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.spl.sym
0xffffff800056a2e4 (0xffffff800056a2e4)
spl_cv_wait (in net.lundman.spl.sym) + 54 (0xffffff7f80e52a76)
taskq_wait (in net.lundman.spl.sym) + 78 (0xffffff7f80e53fae)
taskq_destroy (in net.lundman.spl.sym) + 35 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.zfs.sym
0xffffff7f80e54173 (0xffffff7f80e54173)
vdev_open_children (in net.lundman.zfs.sym) + 336 (0xffffff7f80f1a870)
vdev_root_open (in net.lundman.zfs.sym) + 94 (0xffffff7f80f2bb4e)
vdev_open (in net.lundman.zfs.sym) + 311 (0xffffff7f80f1a9b7)
vdev_create (in net.lundman.zfs.sym) + 31 (0xffffff7f80f1b65f)
spa_create (in net.lundman.zfs.sym) + 878 (0xffffff7f80f042ee)
</syntaxhighlight>

Voilà!

=== Memory leaks ===

(Note that this section is only relevant to old O3X implementation that used the zones allocator - we now use our own kmem allocator).

In some cases, you may suspect memory issues, for instance if you saw the following panic:

<syntaxhighlight lang="text">
panic(cpu 1 caller 0xffffff80002438d8): "zalloc: \"kalloc.1024\" (100535 elements) retry fail 3, kfree_nop_count: 0"@/SourceCache/xnu/xnu-2050.7.9/osfmk/kern/zalloc.c:1826
</syntaxhighlight>

To debug this, you can attach GDB and use the zprint command:

<syntaxhighlight lang="text">
(gdb) zprint
ZONE COUNT TOT_SZ MAX_SZ ELT_SZ ALLOC_SZ TOT_ALLOC TOT_FREE NAME
0xffffff8002a89250 1620133 18c1000 22a3599 16 1000 125203838 123583705 kalloc.16 CX
0xffffff8006306c50 110335 35f000 4ce300 32 1000 13634985 13524650 kalloc.32 CX
0xffffff8006306a00 133584 82a000 e6a900 64 1000 26510120 26376536 kalloc.64 CX
0xffffff80063067b0 610090 4a84000 614f4c0 128 1000 50524515 49914425 kalloc.128 CX
0xffffff8006306560 1070398 121a2000 1b5e4d60 256 1000 72534632 71464234 kalloc.256 CX
0xffffff8006306310 399302 d423000 daf26b0 512 1000 39231204 38831902 kalloc.512 CX
0xffffff80063060c0 100404 6231000 c29e980 1024 1000 22949693 22849289 kalloc.1024 CX
0xffffff8006305e70 292 9a000 200000 2048 1000 77633725 77633433 kalloc.2048 CX
</syntaxhighlight>

In this case, kalloc.256 is suspect.

Reboot kernel with zlog=kalloc.256 on the command line, then we can use

<syntaxhighlight lang="text">
(gdb) findoldest
oldest record is at log index 393:

--------------- ALLOC 0xffffff803276ec00 : index 393 : ztime 21643824 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
and indeed, list any index

(gdb) zstack 394

--------------- ALLOC 0xffffff8032d60700 : index 394 : ztime 21648810 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
How many times was zfs_kmem_alloc involved in the leaked allocs?

(gdb) countpcs 0xffffff7f80e847df
occurred 3999 times in log (100% of records)
</syntaxhighlight>

At least we know it is our fault.

How many times is it arc_buf_alloc?

<syntaxhighlight lang="text">
(gdb) countpcs 0xffffff7f80e90649
occurred 2390 times in log (59% of records)
</syntaxhighlight>

=== Memory Architecture ===

ZFS is designed to aggressively cache filesystem data in main memory. The result of this caching can be significant filesystem performance improvement.

Selection of an allocator has been very challenging on OS X. In the last year we have evolved from:
* Direct call to OSMalloc - a very low level allocator in the kernel - rejected because of slow performance and because the minimum allocation size is one page (4k)
* Direct call to zalloc - the OS X zones allocator - rejected because only 25% of the machines memory can be accessed (50% under some circumstances), and because the result of exceeding this limit is a kernel panic with no other feedback mechanisms available.
* Direct call to bmalloc - bmalloc was a home grown slice allocator that allocated slices of memory from the kernel page allocator, and subdivided into smaller units of allocation to use by ZFS. This was quite successful but very space inefficient. Was used in O3X 1.2.7 and 1.3.0. At this stage we had no real response to memory pressure in the machine, so the total memory allocation to O3X was kept to 50% of the machine.
* Implementation of kmem and vmem allocators using code from Illumos. Provision of a memory pressure monitor mechanism - we are now able to allocate most of the machines memory to ZFS, and scale that back when the machine experiences memory pressure.

O3X has the Solaris Porting Layer (SPL). The SPL has long since provided the Illumos kmem.h API for use by ZFS. In O3X releases up to 1.3.0 the kmem implementation has been a stub that passes allocation requests to an underlying allocator. In O3X 1.3.0 we were still missing some key behaviours in the allocator - efficient lifecycle control of objects, and an effective response to memory pressure in the machine, and the allocator was not very space efficient because of metadata overheads in bmalloc. We were also not convinced that bmalloc represented the state of the art.

Our strategy was to determine how much of the Illumos allocator could be implemented on OS X. After a series of experiments where we implemented significant portions of the kmem code from illumos on top of bmalloc, we had learned enough to take the final step of essentially copying the entire kmem/vmem allocator stack from Illumos. Some portions of the kmem code have been disabled in kmem such as logging, and hot swap CPU support have been disabled due to architectural differences between OS X and Illumos.

By default kmem/vmem require a certain level of performance from the OS page allocator. It is easy to overwhelm the OS X page allocator. We tuned vmem to use 512Kb chunks of memory from the page allocator rather than the smaller allocations that vmem prefers. This is less than ideal as it reduces the ability for vmem to smoothly release memory to the page allocator when the machine is under pressure. While we have an adequately performing solution now, there will always be a tension between our allocator and OS X itself. OS X only provides minimal mechanisms to observe and respond to memory pressure in the machine, so we are somewhat limited in what can be achieved in this regard.

References:

Jeff Bonwicks paper - kmem and vmem implement this design. https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/

=== Detecting memory handling errors ===

The kmem allocator has an internal diagnostic mode. In diagnostic mode the allocator instruments heap memory with various features and markers as it is allocated and released by application code. These markers are checked as the program runs, and can determine when an application has exhibited one or more of a set of common memory handling errors. The debugging mode is disabled by default as it carries a significant performance penalty.

The memory handling errors that can be detected include:
* Modify after free
* Write past end of buffer
* Free of memory not managed by kmem
* Double free of memory
* Various other corruptions
* Freed size != allocated size
* Freed address != allocated address

Debug mode is enabled by compiling with the preprocessor symbol DEBUG defined. At a minimum spl-kmem.c and spl-osx.c need to see this define for the debugging features to be completely enabled.

In debugging mode you must choose whether kmem will log the fault and then panic, or just log. If you elect to panic, there is a very high chance that the full log message will not be stored in system.log before the OS halts, and you will have to connect to the machine with lldb and use the "systemlog" command to view the diagnostic message. If you elect to not panic, the program will continue to run despite the memory corruption, with undefined consequences. In spl-kmem.c set kmem_panic=0 to log, kmem_panic=1 to log+panic.

Example:

I modified spl_start() to include the following:

{
...
int *p;
for(int i=0; i<20;i++) {
p = (int*)spl_kmem_alloc(1024);
spl_kmem_free(p);
*p = 0;
}
}

With the debug mode enabled the following was logged:

14/08/2015 5:09:47.000 PM kernel[0]: SPL: kernel memory allocator: buffer modified after being freed
14/08/2015 5:09:47.000 PM kernel[0]: SPL: modification occurred at offset 0x0 (0xdeadbeefdeadbeef replaced by 0xdeadbeef00000000)
14/08/2015 5:09:47.000 PM kernel[0]: SPL: buffer=0xffffff887a87d980 bufctl=0xffffff887a7ad840 cache: kmem_alloc_1152
14/08/2015 5:09:47.000 PM kernel[0]: SPL: previous transaction on buffer 0xffffff887a87d980:
14/08/2015 5:09:47.000 PM kernel[0]: SPL: thread=0 time=T-0.000001383 slab=0xffffff887a5ffe68 cache: kmem_alloc_1152
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _kmem_cache_free_debug + 0x227
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _kmem_cache_free + 0x173
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _zfs_kmem_free + 0x2c4
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _spl_start + 0x2bb
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext5startEb + 0x40b
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext4loadEhhP7OSArray + 0xdd
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext4loadEhhP7OSArray + 0x3e1
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext22loadKextWithIdentifierEP8OSStringbbhhP7OSArray + 0xf2
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZNK11IOCatalogue14isModuleLoadedEP12OSDictionary + 0xe0
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN9IOService15probeCandidatesEP12OSOrderedSet + 0x2c4
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN9IOService14doServiceMatchEj + 0x22a
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN15_IOConfigThread4mainEPvi + 0x13c
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : _call_continuation + 0x17

You can clearly see the kind of memory corruption, the actual corrupted data, which kmem cache was involved, the relative time that the last action occurred, and the stack trace for the last action (which was a call to zfs_kmem_free()) - indicating that spl_start() was implicated in the fault. This event would have logged on the next allocated after the free and modify occurred.

=== Compiling to lower OSX versions ===

If you wish to compile O3X to a specific OSX version, in this case, compiling for 10.9 on a 10.10

SPL:
./configure --with-kernel-headers=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

ZFS:
./configure --with-kernelsrc=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

== Flamegraphs ==

Huge thanks to [http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html BrendanGregg] for so much of the dtrace magic.

dtrace the kernel while running a command:

<syntaxhighlight lang="bash">
$ sudo dtrace -x stackframes=100 -n 'profile-997 /arg0/ {
@[stack()] = count(); } tick-60s { exit(0); }' -o out.stacks
</syntaxhighlight>

It will run for 60 seconds.

Convert it to a flamegraph:

<syntaxhighlight lang="bash">
$ ./stackcollapse.pl out.stacks > out.folded
$ ./flamegraph.pl out.folded > out.svg
</syntaxhighlight>

This is <code>rsync -a /usr/ /BOOM/deletea/</code> running:

[[File:rsyncflamegraph.svg|thumb|rsync flamegraph]]

Or running '''Bonnie++''' in various stages:

<gallery mode="packed-hover">
File:create.svg|Create files in sequential order|alt=[[File:create.svg]]
File:stat.svg|Stat files in sequential order|alt=Stat files in sequential order
File:delete.svg|Delete files in sequential order|alt=Delete files in sequential order
</gallery>

[[File:VX_create.svg|thumb|Create files in sequential order]]

 

[[File:iozone.svg|thumb|IOzone flamegraph]]

[[File:iozoneX.svg|thumb|IOzone flamegraph (untrimmed)]]

 

------

== Iozone ==

Quick peek at how they compare, just to see how much we should improve it by.

HFS+ and ZFS were created on the same virtual disk in VMware. Of course, this is not ideal testing specs, but should serve as an indicator.

The pool was created with

<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 \
-O atime=off \
-O casesensitivity=insensitive \
-O normalization=formD \
BOOM /dev/disk1
</syntaxhighlight>

and the HFS+ file system was created with the standard OS X Disk Utility.app, with everything default (journaled, case-insensitive).

'''Iozone''' was run with standard automode:

<syntaxhighlight lang="bash">
sudo iozone -a -b outfile.xls
</syntaxhighlight>

[[File:hfs2_read.png|thumb|HFS+ read]]
[[File:hfs2_write.png|thumb|HFS+ write]]
[[File:zfs2_read.png|thumb|ZFS read]]
[[File:zfs2_write.png|thumb|ZFS write]]

As a guess, writes need to double, and reads need to triple.

=== VFS ===

[[VFS]]

== File-based zpools for testing==

* create 2 files (each 100 MB) to be used as block devices:
<syntaxhighlight lang="bash">
$ dd if=/dev/zero bs=1m count=100 of=vdisk1
$ dd if=/dev/zero bs=1m count=100 of=vdisk2
</syntaxhighlight>

* attach files as raw disk images:
<syntaxhighlight lang="bash">
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk1
/dev/disk2
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk2
/dev/disk3
</syntaxhighlight>

* create mirrored zpool:
<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 -O casesensitivity=insensitive -O normalization=formD tank mirror disk2 disk3
</syntaxhighlight>

* show zpool:
<syntaxhighlight lang="bash">
$ sudo zpool status
pool: tank
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
disk2 ONLINE 0 0 0
disk3 ONLINE 0 0 0

errors: No known data errors
</syntaxhighlight>

* test ZFS features, find bugs, ...

* export zpool:
<syntaxhighlight lang="bash">
$ sudo zpool export tank
</syntaxhighlight>

* detach raw images:
<syntaxhighlight lang="bash">
$ hdiutil detach disk2
"disk2" unmounted.
"disk2" ejected.
$ hdiutil detach disk3
"disk3" unmounted.
"disk3" ejected.
</syntaxhighlight>

== Platform differences ==

This section is an attempt to outline the differences from ZFS versions of other platforms, as compared to OS X. To assist developers new to the Apple platform, who wishes to assist, or understand, development of the O3X version.

=== Reclaim ===

One of the biggest hassles with OS X is the VFS layer's handling of reclaim. First it is worth noting that "struct vnode" is an opaque type, so we are not allowed to see, nor modify, the contents of a vnode.
(Of course, we could craft a mirror struct of vnode and tailor it to each OS X version where vnode changes. But that is rather hacky.)

Following that, the '''only''' place where you can set the '''vtype''' (VREG, VDIR), '''vdata''' (user pointer to hold the ZFS znode), '''vfsops''' (list of filesystem calls "vnops") etc, is '''only in calling vnode_create()'''.
So there is no way to "allocate an empty vnode, and set its values later". The FreeBSD method of pre-allocating vnodes, to avoid reclaim, can not be done.
ZFS will start a new dmu_tx, then call zfs_mknode which will eventually call vnode_create, so we can not do anything with dmu_tx in those vnops.

The problem is, if vnode_create decides to reclaim, it will do so directly, as the same thread. It will end up in vclean() which can call vnop_fsync, vnop_pageout, vnop_inactive and vnop_reclaim. The first three of these calls, we can
use the API call vnode_isrecycled() to detect if these vnops are called "the normal way", or from vclean. If we come from vclean, and the vnode is doomed, we will do as little as possible. We can not open a new TX, and
we can not use mutex locks (panic: locking against ourselves).

Nor is there any way to defer, or delay, a doomed vnode. If vnop_reclaim returns anything but 0, you find the lovely XNU code of
2205 if (VNOP_RECLAIM(vp, ctx))
2206 panic("vclean: cannot reclaim");
in vfs_subr.c

So, at the moment there is some extra logic in '''zfs_vnop_reclaim''' to handle that we might be re-entrant as the '''vnode_create''' thread.

exception = ((zp->z_sa_hdl != NULL) &&
zp->z_unlinked) ? B_TRUE : B_FALSE;
fastpath = zp->z_fastpath;

if both exception and fastpath are FALSE, we can call direct reclaim right there. As in those cases, no final dmu_tx is caused. Following
the zfs_rmnode->zfs_purgedir->zget and similar paths, exception is set to TRUE.

If exception is TRUE, we add the zp to the reclaim_list, and the separate reclaim_thread will call zfs_rmnode(zp). As a separate thread it can handle calling
dmu_tx.

If fastpath is TRUE, we do no more/nothing in zfs_vnop_reclaim. See below.

=== Fastpath vs Recycle ===

Another interesting aspect is that IllumOS has a delete fastpath. In zfs_remove, if it is detected that the znode can be "deleted_now", it marks the vnode as free and directly calls zfs_znode_delete(), if it can not, adds it to zfs_unlinked_add().

In OS X, there is no way to directly release a vnode. Ie, XNU always has full control of the vnodes. Even if you call vnode_recycle(), the vnode is not released '''until''' vnop_reclaim is called. The vnode can just be marked for later reclaim, but remain active (especially if you are racing against other threads using the same vnode). So in zfs_remove, we attempt to call vnode_recycle(), and only if this returns "1" do we know that vnop_reclaim was called, and we can directly call zfs_znode_delete(). Note that the O3X vnop_reclaim handler then has special code to not do anything with the vnode (zp->z_fastpath) but to only clear out the z_vnode and return.

zp->z_fastpath = B_TRUE;
if (vnode_recycle(vp) == 1) {
/* recycle/reclaim is done, so we can just release now */
zfs_znode_delete(zp, tx);
} else {
/* failed to recycle, so just place it on the unlinked list */
zp->z_fastpath = B_FALSE;
zfs_unlinked_add(zp, tx);
}

There is also a little special lock-handling in zfs_zinactive, since we can call it from inside of a vnode_create() which is called by ZFS with locks held. If this is the case, we do not attempt to acquire locks in zfs_zinactive.

=== snapshot mounts ===

There is no way to cause a mount in XNU kernel. None. At. All. Apple themselves cheated and added a static nfsmount() that we can not call. So instead, we have to jump through a whole bunch of
hoops to get there. We create a fake/virtual /dev/diskX entry for the snapshot. '''diskarbitrationd''' will wake up due to new disk, it will enter the probe phase, which includes calling
all the /System/Library/Filesystems/ bundles. Eventually, zfs.util is called and we reply affirmative. However, automount is disable here, as there is no way to specify a mountpoint with auto.
zfs.util will call DADiskMount to mount it to the correct directory.

This means we have a few more VNOPs in zfs_ctldir.c, as we have to reply with correct information to make mount successful. The first getattr will cause the mount attempt, the DADiskMount call will cause getattr to be called
and we have to pretend to have said entry.

=== spl_vn_rdwr vs vn_rdwr ===

There are two calls to vn_rdwr() in OSX's SPL. The '''spl_vn_rdwr()''' call needs to be used when zfs_onexit is in use. For example, dmu_send.c (zfs recv/send) and zfs_ioc_diff (zfs diff). The XNU implementation of
zfs_onexit (as in calls to '''getf''' and '''releasef''') need to place the internal XNU ''struct fileproc''' in the wrapper ''struct spl_fileproc'', so that '''spl_vn_rdwr()''' can use it to do IO.
This is the only way to do IO on a non-file based vnode (ie, pipe or socket). Other places that call vn_rdwr(), for example vdev_file.c, needs to call the regular vn_rdwr.

=== getattr ===

XNU has a whole bunch of items that it can ask for in vnop_getattr, including VA_NAME, which is used heavily by Finder (especially in the vfs_vget path). Care is needed here to return the correct name,
including for link (hard links) targets. VNOP_LOOKUP records the name that was used in the lookup, so that a following stat call (vnop_getattr) on the vnode will return the correct name if VA_NAME is requested.

Development

2015-08-15T00:21:03Z

101.175.67.14: /* Detecting memory handling errors */

[[Category:O3X development]]
You should also familiarize yourself with the [[Project_roadmap|project roadmap]] so that you can put the technical details here in context.

== Kernel ==

=== Debugging with GDB ===

Dealing with [[Panic|panics]].

Apple's documentation: https://developer.apple.com/library/mac/documentation/Darwin/Conceptual/KEXTConcept/KEXTConceptDebugger/debug_tutorial.html

Boot target VM with

<syntaxhighlight lang="bash">
$ sudo nvram boot-args="-v keepsyms=y debug=0x144"
</syntaxhighlight>

Make it panic.

On your development machine, you will need the Kernel Debug Kit. Download it from Apple [https://developer.apple.com/downloads/index.action?q=Kernel%20Debug%20Kit here].

<syntaxhighlight lang="text">
$ gdb /Volumes/Kernelit/mach_kernel
(gdb) source /Volumes/KernelDebugKit/kgmacros
(gdb) target remote-kdp
(gdb) kdp-reattach 192.168.30.133 # obviously use the IP of your target / crashed VM
(gdb) showallkmods
</syntaxhighlight>

Find the addresses for ZFS and SPL modules.

<code>^Z</code> to suspend gdb, or, use another terminal

<syntaxhighlight lang="bash">
^Z
$ sudo kextutil -s /tmp -n \
-k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ \
../spl/module/spl/spl.kext/
</syntaxhighlight>

Then resume gdb, or go back to gdb terminal.
<syntaxhighlight lang="text">
$ fg
(gdb) set kext-symbol-file-path /tmp
(gdb) add-kext /tmp/spl.kext
(gdb) add-kext /tmp/zfs.kext
(gdb) bt
</syntaxhighlight>

=== Debugging with LLDB ===

<syntaxhighlight lang="bash">
$ echo "settings set target.load-script-from-symbol-file true" >> ~/.lldbinit
$ lldb /Volumes/KernelDebugKit/mach_kernel # From Yosemite, "/Library/Developer/KDKs/KDK_10.10_14B25.kdk/System/Library/Kernels/kernel"
(lldb) kdp-remote 192.168.30.146
(lldb) showallkmods
(lldb) addkext -F /tmp/spl.kext/Contents/MacOS/spl 0xffffff7f8ebb0000 (Address from showallkmods)
(lldb) addkext -F /tmp/zfs.kext/Contents/MacOS/zfs 0xffffff7f8ebbf000
</syntaxhighlight>

Then follow the guide for GDB above.

=== Non-panic ===

If you prefer to work in GDB, you can always panic a kernel with

<syntaxhighlight lang="bash">
$ sudo dtrace -w -n "BEGIN{ panic();}"
</syntaxhighlight>

But this was revealing:

<syntaxhighlight lang="bash">
$ sudo /usr/libexec/stackshot -i -f /tmp/stackshot.log
$ sudo symstacks.rb -f /tmp/stackshot.log -s -w /tmp/trace.txt
$ less /tmp/trace.txt
</syntaxhighlight>

Note that my hang is here:

<syntaxhighlight lang="text">
PID: 156
Process: zpool
Thread ID: 0x4e2
Thread state: 0x9 == TH_WAIT |TH_UNINT
Thread wait_event: 0xffffff8006608a6c
Kernel stack:
machine_switch_context (in mach_kernel) + 366 (0xffffff80002b3d3e)
0xffffff800022e711 (in mach_kernel) + 1281 (0xffffff800022e711)
thread_block_reason (in mach_kernel) + 300 (0xffffff800022d9dc)
lck_mtx_sleep (in mach_kernel) + 78 (0xffffff80002265ce)
0xffffff8000569ef6 (in mach_kernel) + 246 (0xffffff8000569ef6)
msleep (in mach_kernel) + 116 (0xffffff800056a2e4)
0xffffff7f80e52a76 (0xffffff7f80e52a76)
0xffffff7f80e53fae (0xffffff7f80e53fae)
0xffffff7f80e54173 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)
0xffffff7f80f2bb4e (0xffffff7f80f2bb4e)
0xffffff7f80f1a9b7 (0xffffff7f80f1a9b7)
0xffffff7f80f1b65f (0xffffff7f80f1b65f)
0xffffff7f80f042ee (0xffffff7f80f042ee)
0xffffff7f80f45c5b (0xffffff7f80f45c5b)
0xffffff7f80f4ce92 (0xffffff7f80f4ce92)
spec_ioctl (in mach_kernel) + 157 (0xffffff8000320bfd)
VNOP_IOCTL (in mach_kernel) + 244 (0xffffff8000311e84)
</syntaxhighlight>

It is a shame that it only shows the kernel symbols, and not inside SPL and ZFS, but we can ask it to load another sym file. (Alas, it cannot handle multiple symbols files. Fix this Apple.)

<syntaxhighlight lang="bash">
$ sudo kextstat #grab the addresses of SPL and ZFS again
$ sudo kextutil -s /tmp -n -k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ ../spl/module/spl/spl.kext/

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.spl.sym
0xffffff800056a2e4 (0xffffff800056a2e4)
spl_cv_wait (in net.lundman.spl.sym) + 54 (0xffffff7f80e52a76)
taskq_wait (in net.lundman.spl.sym) + 78 (0xffffff7f80e53fae)
taskq_destroy (in net.lundman.spl.sym) + 35 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.zfs.sym
0xffffff7f80e54173 (0xffffff7f80e54173)
vdev_open_children (in net.lundman.zfs.sym) + 336 (0xffffff7f80f1a870)
vdev_root_open (in net.lundman.zfs.sym) + 94 (0xffffff7f80f2bb4e)
vdev_open (in net.lundman.zfs.sym) + 311 (0xffffff7f80f1a9b7)
vdev_create (in net.lundman.zfs.sym) + 31 (0xffffff7f80f1b65f)
spa_create (in net.lundman.zfs.sym) + 878 (0xffffff7f80f042ee)
</syntaxhighlight>

Voilà!

=== Memory leaks ===

(Note that this section is only relevant to old O3X implementation that used the zones allocator - we now use our own kmem allocator).

In some cases, you may suspect memory issues, for instance if you saw the following panic:

<syntaxhighlight lang="text">
panic(cpu 1 caller 0xffffff80002438d8): "zalloc: \"kalloc.1024\" (100535 elements) retry fail 3, kfree_nop_count: 0"@/SourceCache/xnu/xnu-2050.7.9/osfmk/kern/zalloc.c:1826
</syntaxhighlight>

To debug this, you can attach GDB and use the zprint command:

<syntaxhighlight lang="text">
(gdb) zprint
ZONE COUNT TOT_SZ MAX_SZ ELT_SZ ALLOC_SZ TOT_ALLOC TOT_FREE NAME
0xffffff8002a89250 1620133 18c1000 22a3599 16 1000 125203838 123583705 kalloc.16 CX
0xffffff8006306c50 110335 35f000 4ce300 32 1000 13634985 13524650 kalloc.32 CX
0xffffff8006306a00 133584 82a000 e6a900 64 1000 26510120 26376536 kalloc.64 CX
0xffffff80063067b0 610090 4a84000 614f4c0 128 1000 50524515 49914425 kalloc.128 CX
0xffffff8006306560 1070398 121a2000 1b5e4d60 256 1000 72534632 71464234 kalloc.256 CX
0xffffff8006306310 399302 d423000 daf26b0 512 1000 39231204 38831902 kalloc.512 CX
0xffffff80063060c0 100404 6231000 c29e980 1024 1000 22949693 22849289 kalloc.1024 CX
0xffffff8006305e70 292 9a000 200000 2048 1000 77633725 77633433 kalloc.2048 CX
</syntaxhighlight>

In this case, kalloc.256 is suspect.

Reboot kernel with zlog=kalloc.256 on the command line, then we can use

<syntaxhighlight lang="text">
(gdb) findoldest
oldest record is at log index 393:

--------------- ALLOC 0xffffff803276ec00 : index 393 : ztime 21643824 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
and indeed, list any index

(gdb) zstack 394

--------------- ALLOC 0xffffff8032d60700 : index 394 : ztime 21648810 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
How many times was zfs_kmem_alloc involved in the leaked allocs?

(gdb) countpcs 0xffffff7f80e847df
occurred 3999 times in log (100% of records)
</syntaxhighlight>

At least we know it is our fault.

How many times is it arc_buf_alloc?

<syntaxhighlight lang="text">
(gdb) countpcs 0xffffff7f80e90649
occurred 2390 times in log (59% of records)
</syntaxhighlight>

=== Memory Architecture ===

ZFS is designed to aggressively cache filesystem data in main memory. The result of this caching can be significant filesystem performance improvement.

Selection of an allocator has been very challenging on OS X. In the last year we have evolved from:
* Direct call to OSMalloc - a very low level allocator in the kernel - rejected because of slow performance and because the minimum allocation size is one page (4k)
* Direct call to zalloc - the OS X zones allocator - rejected because only 25% of the machines memory can be accessed (50% under some circumstances), and because the result of exceeding this limit is a kernel panic with no other feedback mechanisms available.
* Direct call to bmalloc - bmalloc was a home grown slice allocator that allocated slices of memory from the kernel page allocator, and subdivided into smaller units of allocation to use by ZFS. This was quite successful but very space inefficient. Was used in O3X 1.2.7 and 1.3.0. At this stage we had no real response to memory pressure in the machine, so the total memory allocation to O3X was kept to 50% of the machine.
* Implementation of kmem and vmem allocators using code from Illumos. Provision of a memory pressure monitor mechanism - we are now able to allocate most of the machines memory to ZFS, and scale that back when the machine experiences memory pressure.

O3X has the Solaris Porting Layer (SPL). The SPL has long since provided the Illumos kmem.h API for use by ZFS. In O3X releases up to 1.3.0 the kmem implementation has been a stub that passes allocation requests to an underlying allocator. In O3X 1.3.0 we were still missing some key behaviours in the allocator - efficient lifecycle control of objects, and an effective response to memory pressure in the machine, and the allocator was not very space efficient because of metadata overheads in bmalloc. We were also not convinced that bmalloc represented the state of the art.

Our strategy was to determine how much of the Illumos allocator could be implemented on OS X. After a series of experiments where we implemented significant portions of the kmem code from illumos on top of bmalloc, we had learned enough to take the final step of essentially copying the entire kmem/vmem allocator stack from Illumos. Some portions of the kmem code have been disabled in kmem such as logging, and hot swap CPU support have been disabled due to architectural differences between OS X and Illumos.

By default kmem/vmem require a certain level of performance from the OS page allocator. It is easy to overwhelm the OS X page allocator. We tuned vmem to use 512Kb chunks of memory from the page allocator rather than the smaller allocations that vmem prefers. This is less than ideal as it reduces the ability for vmem to smoothly release memory to the page allocator when the machine is under pressure. While we have an adequately performing solution now, there will always be a tension between our allocator and OS X itself. OS X only provides minimal mechanisms to observe and respond to memory pressure in the machine, so we are somewhat limited in what can be achieved in this regard.

References:

Jeff Bonwicks paper - kmem and vmem implement this design. https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/

=== Detecting memory handling errors ===

The kmem allocator has an internal diagnostic mode. In diagnostic mode the allocator instruments heap memory with various features and markers as it is allocated and released by application code. These markers are checked as the program runs, and can determine when an application has exhibited one or more of a set of common memory handling errors. The debugging mode is disabled by default as it carries a significant performance penalty.

The memory handling errors that can be detected include:
* Modify after free
* Write past end of buffer
* Free of memory not managed by kmem
* Double free of memory
* Various other corruptions
* Freed size != allocated size
* Freed address != allocated address

Debug mode is enabled by compiling with the preprocessor symbol DEBUG defined. At a minimum spl-kmem.c and spl-osx.c need to see this define for the debugging features to be completely enabled.

In debugging mode you must choose whether kmem will log the fault and then panic, or just log. If you elect to panic, there is a very high chance that the full log message will not be stored in system.log before the OS halts, and you will have to connect to the machine with lldb and use the "systemlog" command to view the diagnostic message. If you elect to not panic, the program will continue to run despite the memory corruption, with undefined consequences. In spl-kmem.c set kmem_panic=0 to log, kmem_panic=1 to log+panic.

Example:

I modified spl_start() to include the following:

{
...
int *p;
for(int i=0; i<20;i++) {
p = (int*)spl_kmem_alloc(1024);
spl_kmem_free(p);
*p = 0;
}

With the debug mode enabled the following was logged:

14/08/2015 5:09:47.000 PM kernel[0]: SPL: kernel memory allocator: buffer modified after being freed
14/08/2015 5:09:47.000 PM kernel[0]: SPL: modification occurred at offset 0x0 (0xdeadbeefdeadbeef replaced by 0xdeadbeef00000000)
14/08/2015 5:09:47.000 PM kernel[0]: SPL: buffer=0xffffff887a87d980 bufctl=0xffffff887a7ad840 cache: kmem_alloc_1152
14/08/2015 5:09:47.000 PM kernel[0]: SPL: previous transaction on buffer 0xffffff887a87d980:
14/08/2015 5:09:47.000 PM kernel[0]: SPL: thread=0 time=T-0.000001383 slab=0xffffff887a5ffe68 cache: kmem_alloc_1152
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _kmem_cache_free_debug + 0x227
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _kmem_cache_free + 0x173
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _zfs_kmem_free + 0x2c4
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _spl_start + 0x2bb
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext5startEb + 0x40b
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext4loadEhhP7OSArray + 0xdd
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext4loadEhhP7OSArray + 0x3e1
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext22loadKextWithIdentifierEP8OSStringbbhhP7OSArray + 0xf2
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZNK11IOCatalogue14isModuleLoadedEP12OSDictionary + 0xe0
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN9IOService15probeCandidatesEP12OSOrderedSet + 0x2c4
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN9IOService14doServiceMatchEj + 0x22a
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN15_IOConfigThread4mainEPvi + 0x13c
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : _call_continuation + 0x17

You can clearly see the kind of memory corruption, the actual corrupted data, which kmem cache was involved, the relative time that the last action occurred, and the stack trace for the last action (which was a call to zfs_kmem_free()) - indicating that spl_start() was implicated in the fault. This event would have logged on the next allocated after the free and modify occurred.

=== Compiling to lower OSX versions ===

If you wish to compile O3X to a specific OSX version, in this case, compiling for 10.9 on a 10.10

SPL:
./configure --with-kernel-headers=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

ZFS:
./configure --with-kernelsrc=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

== Flamegraphs ==

Huge thanks to [http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html BrendanGregg] for so much of the dtrace magic.

dtrace the kernel while running a command:

<syntaxhighlight lang="bash">
$ sudo dtrace -x stackframes=100 -n 'profile-997 /arg0/ {
@[stack()] = count(); } tick-60s { exit(0); }' -o out.stacks
</syntaxhighlight>

It will run for 60 seconds.

Convert it to a flamegraph:

<syntaxhighlight lang="bash">
$ ./stackcollapse.pl out.stacks > out.folded
$ ./flamegraph.pl out.folded > out.svg
</syntaxhighlight>

This is <code>rsync -a /usr/ /BOOM/deletea/</code> running:

[[File:rsyncflamegraph.svg|thumb|rsync flamegraph]]

Or running '''Bonnie++''' in various stages:

<gallery mode="packed-hover">
File:create.svg|Create files in sequential order|alt=[[File:create.svg]]
File:stat.svg|Stat files in sequential order|alt=Stat files in sequential order
File:delete.svg|Delete files in sequential order|alt=Delete files in sequential order
</gallery>

[[File:VX_create.svg|thumb|Create files in sequential order]]

 

[[File:iozone.svg|thumb|IOzone flamegraph]]

[[File:iozoneX.svg|thumb|IOzone flamegraph (untrimmed)]]

 

------

== Iozone ==

Quick peek at how they compare, just to see how much we should improve it by.

HFS+ and ZFS were created on the same virtual disk in VMware. Of course, this is not ideal testing specs, but should serve as an indicator.

The pool was created with

<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 \
-O atime=off \
-O casesensitivity=insensitive \
-O normalization=formD \
BOOM /dev/disk1
</syntaxhighlight>

and the HFS+ file system was created with the standard OS X Disk Utility.app, with everything default (journaled, case-insensitive).

'''Iozone''' was run with standard automode:

<syntaxhighlight lang="bash">
sudo iozone -a -b outfile.xls
</syntaxhighlight>

[[File:hfs2_read.png|thumb|HFS+ read]]
[[File:hfs2_write.png|thumb|HFS+ write]]
[[File:zfs2_read.png|thumb|ZFS read]]
[[File:zfs2_write.png|thumb|ZFS write]]

As a guess, writes need to double, and reads need to triple.

=== VFS ===

[[VFS]]

== File-based zpools for testing==

* create 2 files (each 100 MB) to be used as block devices:
<syntaxhighlight lang="bash">
$ dd if=/dev/zero bs=1m count=100 of=vdisk1
$ dd if=/dev/zero bs=1m count=100 of=vdisk2
</syntaxhighlight>

* attach files as raw disk images:
<syntaxhighlight lang="bash">
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk1
/dev/disk2
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk2
/dev/disk3
</syntaxhighlight>

* create mirrored zpool:
<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 -O casesensitivity=insensitive -O normalization=formD tank mirror disk2 disk3
</syntaxhighlight>

* show zpool:
<syntaxhighlight lang="bash">
$ sudo zpool status
pool: tank
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
disk2 ONLINE 0 0 0
disk3 ONLINE 0 0 0

errors: No known data errors
</syntaxhighlight>

* test ZFS features, find bugs, ...

* export zpool:
<syntaxhighlight lang="bash">
$ sudo zpool export tank
</syntaxhighlight>

* detach raw images:
<syntaxhighlight lang="bash">
$ hdiutil detach disk2
"disk2" unmounted.
"disk2" ejected.
$ hdiutil detach disk3
"disk3" unmounted.
"disk3" ejected.
</syntaxhighlight>

== Platform differences ==

This section is an attempt to outline the differences from ZFS versions of other platforms, as compared to OS X. To assist developers new to the Apple platform, who wishes to assist, or understand, development of the O3X version.

=== Reclaim ===

One of the biggest hassles with OS X is the VFS layer's handling of reclaim. First it is worth noting that "struct vnode" is an opaque type, so we are not allowed to see, nor modify, the contents of a vnode.
(Of course, we could craft a mirror struct of vnode and tailor it to each OS X version where vnode changes. But that is rather hacky.)

Following that, the '''only''' place where you can set the '''vtype''' (VREG, VDIR), '''vdata''' (user pointer to hold the ZFS znode), '''vfsops''' (list of filesystem calls "vnops") etc, is '''only in calling vnode_create()'''.
So there is no way to "allocate an empty vnode, and set its values later". The FreeBSD method of pre-allocating vnodes, to avoid reclaim, can not be done.
ZFS will start a new dmu_tx, then call zfs_mknode which will eventually call vnode_create, so we can not do anything with dmu_tx in those vnops.

The problem is, if vnode_create decides to reclaim, it will do so directly, as the same thread. It will end up in vclean() which can call vnop_fsync, vnop_pageout, vnop_inactive and vnop_reclaim. The first three of these calls, we can
use the API call vnode_isrecycled() to detect if these vnops are called "the normal way", or from vclean. If we come from vclean, and the vnode is doomed, we will do as little as possible. We can not open a new TX, and
we can not use mutex locks (panic: locking against ourselves).

Nor is there any way to defer, or delay, a doomed vnode. If vnop_reclaim returns anything but 0, you find the lovely XNU code of
2205 if (VNOP_RECLAIM(vp, ctx))
2206 panic("vclean: cannot reclaim");
in vfs_subr.c

So, at the moment there is some extra logic in '''zfs_vnop_reclaim''' to handle that we might be re-entrant as the '''vnode_create''' thread.

exception = ((zp->z_sa_hdl != NULL) &&
zp->z_unlinked) ? B_TRUE : B_FALSE;
fastpath = zp->z_fastpath;

if both exception and fastpath are FALSE, we can call direct reclaim right there. As in those cases, no final dmu_tx is caused. Following
the zfs_rmnode->zfs_purgedir->zget and similar paths, exception is set to TRUE.

If exception is TRUE, we add the zp to the reclaim_list, and the separate reclaim_thread will call zfs_rmnode(zp). As a separate thread it can handle calling
dmu_tx.

If fastpath is TRUE, we do no more/nothing in zfs_vnop_reclaim. See below.

=== Fastpath vs Recycle ===

Another interesting aspect is that IllumOS has a delete fastpath. In zfs_remove, if it is detected that the znode can be "deleted_now", it marks the vnode as free and directly calls zfs_znode_delete(), if it can not, adds it to zfs_unlinked_add().

In OS X, there is no way to directly release a vnode. Ie, XNU always has full control of the vnodes. Even if you call vnode_recycle(), the vnode is not released '''until''' vnop_reclaim is called. The vnode can just be marked for later reclaim, but remain active (especially if you are racing against other threads using the same vnode). So in zfs_remove, we attempt to call vnode_recycle(), and only if this returns "1" do we know that vnop_reclaim was called, and we can directly call zfs_znode_delete(). Note that the O3X vnop_reclaim handler then has special code to not do anything with the vnode (zp->z_fastpath) but to only clear out the z_vnode and return.

zp->z_fastpath = B_TRUE;
if (vnode_recycle(vp) == 1) {
/* recycle/reclaim is done, so we can just release now */
zfs_znode_delete(zp, tx);
} else {
/* failed to recycle, so just place it on the unlinked list */
zp->z_fastpath = B_FALSE;
zfs_unlinked_add(zp, tx);
}

There is also a little special lock-handling in zfs_zinactive, since we can call it from inside of a vnode_create() which is called by ZFS with locks held. If this is the case, we do not attempt to acquire locks in zfs_zinactive.

=== snapshot mounts ===

There is no way to cause a mount in XNU kernel. None. At. All. Apple themselves cheated and added a static nfsmount() that we can not call. So instead, we have to jump through a whole bunch of
hoops to get there. We create a fake/virtual /dev/diskX entry for the snapshot. '''diskarbitrationd''' will wake up due to new disk, it will enter the probe phase, which includes calling
all the /System/Library/Filesystems/ bundles. Eventually, zfs.util is called and we reply affirmative. However, automount is disable here, as there is no way to specify a mountpoint with auto.
zfs.util will call DADiskMount to mount it to the correct directory.

This means we have a few more VNOPs in zfs_ctldir.c, as we have to reply with correct information to make mount successful. The first getattr will cause the mount attempt, the DADiskMount call will cause getattr to be called
and we have to pretend to have said entry.

=== spl_vn_rdwr vs vn_rdwr ===

There are two calls to vn_rdwr() in OSX's SPL. The '''spl_vn_rdwr()''' call needs to be used when zfs_onexit is in use. For example, dmu_send.c (zfs recv/send) and zfs_ioc_diff (zfs diff). The XNU implementation of
zfs_onexit (as in calls to '''getf''' and '''releasef''') need to place the internal XNU ''struct fileproc''' in the wrapper ''struct spl_fileproc'', so that '''spl_vn_rdwr()''' can use it to do IO.
This is the only way to do IO on a non-file based vnode (ie, pipe or socket). Other places that call vn_rdwr(), for example vdev_file.c, needs to call the regular vn_rdwr.

=== getattr ===

XNU has a whole bunch of items that it can ask for in vnop_getattr, including VA_NAME, which is used heavily by Finder (especially in the vfs_vget path). Care is needed here to return the correct name,
including for link (hard links) targets. VNOP_LOOKUP records the name that was used in the lookup, so that a following stat call (vnop_getattr) on the vnode will return the correct name if VA_NAME is requested.

Development

2015-08-15T00:17:22Z

101.175.67.14: /* Detecting memory handling errors */

[[Category:O3X development]]
You should also familiarize yourself with the [[Project_roadmap|project roadmap]] so that you can put the technical details here in context.

== Kernel ==

=== Debugging with GDB ===

Dealing with [[Panic|panics]].

Apple's documentation: https://developer.apple.com/library/mac/documentation/Darwin/Conceptual/KEXTConcept/KEXTConceptDebugger/debug_tutorial.html

Boot target VM with

<syntaxhighlight lang="bash">
$ sudo nvram boot-args="-v keepsyms=y debug=0x144"
</syntaxhighlight>

Make it panic.

On your development machine, you will need the Kernel Debug Kit. Download it from Apple [https://developer.apple.com/downloads/index.action?q=Kernel%20Debug%20Kit here].

<syntaxhighlight lang="text">
$ gdb /Volumes/Kernelit/mach_kernel
(gdb) source /Volumes/KernelDebugKit/kgmacros
(gdb) target remote-kdp
(gdb) kdp-reattach 192.168.30.133 # obviously use the IP of your target / crashed VM
(gdb) showallkmods
</syntaxhighlight>

Find the addresses for ZFS and SPL modules.

<code>^Z</code> to suspend gdb, or, use another terminal

<syntaxhighlight lang="bash">
^Z
$ sudo kextutil -s /tmp -n \
-k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ \
../spl/module/spl/spl.kext/
</syntaxhighlight>

Then resume gdb, or go back to gdb terminal.
<syntaxhighlight lang="text">
$ fg
(gdb) set kext-symbol-file-path /tmp
(gdb) add-kext /tmp/spl.kext
(gdb) add-kext /tmp/zfs.kext
(gdb) bt
</syntaxhighlight>

=== Debugging with LLDB ===

<syntaxhighlight lang="bash">
$ echo "settings set target.load-script-from-symbol-file true" >> ~/.lldbinit
$ lldb /Volumes/KernelDebugKit/mach_kernel # From Yosemite, "/Library/Developer/KDKs/KDK_10.10_14B25.kdk/System/Library/Kernels/kernel"
(lldb) kdp-remote 192.168.30.146
(lldb) showallkmods
(lldb) addkext -F /tmp/spl.kext/Contents/MacOS/spl 0xffffff7f8ebb0000 (Address from showallkmods)
(lldb) addkext -F /tmp/zfs.kext/Contents/MacOS/zfs 0xffffff7f8ebbf000
</syntaxhighlight>

Then follow the guide for GDB above.

=== Non-panic ===

If you prefer to work in GDB, you can always panic a kernel with

<syntaxhighlight lang="bash">
$ sudo dtrace -w -n "BEGIN{ panic();}"
</syntaxhighlight>

But this was revealing:

<syntaxhighlight lang="bash">
$ sudo /usr/libexec/stackshot -i -f /tmp/stackshot.log
$ sudo symstacks.rb -f /tmp/stackshot.log -s -w /tmp/trace.txt
$ less /tmp/trace.txt
</syntaxhighlight>

Note that my hang is here:

<syntaxhighlight lang="text">
PID: 156
Process: zpool
Thread ID: 0x4e2
Thread state: 0x9 == TH_WAIT |TH_UNINT
Thread wait_event: 0xffffff8006608a6c
Kernel stack:
machine_switch_context (in mach_kernel) + 366 (0xffffff80002b3d3e)
0xffffff800022e711 (in mach_kernel) + 1281 (0xffffff800022e711)
thread_block_reason (in mach_kernel) + 300 (0xffffff800022d9dc)
lck_mtx_sleep (in mach_kernel) + 78 (0xffffff80002265ce)
0xffffff8000569ef6 (in mach_kernel) + 246 (0xffffff8000569ef6)
msleep (in mach_kernel) + 116 (0xffffff800056a2e4)
0xffffff7f80e52a76 (0xffffff7f80e52a76)
0xffffff7f80e53fae (0xffffff7f80e53fae)
0xffffff7f80e54173 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)
0xffffff7f80f2bb4e (0xffffff7f80f2bb4e)
0xffffff7f80f1a9b7 (0xffffff7f80f1a9b7)
0xffffff7f80f1b65f (0xffffff7f80f1b65f)
0xffffff7f80f042ee (0xffffff7f80f042ee)
0xffffff7f80f45c5b (0xffffff7f80f45c5b)
0xffffff7f80f4ce92 (0xffffff7f80f4ce92)
spec_ioctl (in mach_kernel) + 157 (0xffffff8000320bfd)
VNOP_IOCTL (in mach_kernel) + 244 (0xffffff8000311e84)
</syntaxhighlight>

It is a shame that it only shows the kernel symbols, and not inside SPL and ZFS, but we can ask it to load another sym file. (Alas, it cannot handle multiple symbols files. Fix this Apple.)

<syntaxhighlight lang="bash">
$ sudo kextstat #grab the addresses of SPL and ZFS again
$ sudo kextutil -s /tmp -n -k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ ../spl/module/spl/spl.kext/

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.spl.sym
0xffffff800056a2e4 (0xffffff800056a2e4)
spl_cv_wait (in net.lundman.spl.sym) + 54 (0xffffff7f80e52a76)
taskq_wait (in net.lundman.spl.sym) + 78 (0xffffff7f80e53fae)
taskq_destroy (in net.lundman.spl.sym) + 35 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.zfs.sym
0xffffff7f80e54173 (0xffffff7f80e54173)
vdev_open_children (in net.lundman.zfs.sym) + 336 (0xffffff7f80f1a870)
vdev_root_open (in net.lundman.zfs.sym) + 94 (0xffffff7f80f2bb4e)
vdev_open (in net.lundman.zfs.sym) + 311 (0xffffff7f80f1a9b7)
vdev_create (in net.lundman.zfs.sym) + 31 (0xffffff7f80f1b65f)
spa_create (in net.lundman.zfs.sym) + 878 (0xffffff7f80f042ee)
</syntaxhighlight>

Voilà!

=== Memory leaks ===

(Note that this section is only relevant to old O3X implementation that used the zones allocator - we now use our own kmem allocator).

In some cases, you may suspect memory issues, for instance if you saw the following panic:

<syntaxhighlight lang="text">
panic(cpu 1 caller 0xffffff80002438d8): "zalloc: \"kalloc.1024\" (100535 elements) retry fail 3, kfree_nop_count: 0"@/SourceCache/xnu/xnu-2050.7.9/osfmk/kern/zalloc.c:1826
</syntaxhighlight>

To debug this, you can attach GDB and use the zprint command:

<syntaxhighlight lang="text">
(gdb) zprint
ZONE COUNT TOT_SZ MAX_SZ ELT_SZ ALLOC_SZ TOT_ALLOC TOT_FREE NAME
0xffffff8002a89250 1620133 18c1000 22a3599 16 1000 125203838 123583705 kalloc.16 CX
0xffffff8006306c50 110335 35f000 4ce300 32 1000 13634985 13524650 kalloc.32 CX
0xffffff8006306a00 133584 82a000 e6a900 64 1000 26510120 26376536 kalloc.64 CX
0xffffff80063067b0 610090 4a84000 614f4c0 128 1000 50524515 49914425 kalloc.128 CX
0xffffff8006306560 1070398 121a2000 1b5e4d60 256 1000 72534632 71464234 kalloc.256 CX
0xffffff8006306310 399302 d423000 daf26b0 512 1000 39231204 38831902 kalloc.512 CX
0xffffff80063060c0 100404 6231000 c29e980 1024 1000 22949693 22849289 kalloc.1024 CX
0xffffff8006305e70 292 9a000 200000 2048 1000 77633725 77633433 kalloc.2048 CX
</syntaxhighlight>

In this case, kalloc.256 is suspect.

Reboot kernel with zlog=kalloc.256 on the command line, then we can use

<syntaxhighlight lang="text">
(gdb) findoldest
oldest record is at log index 393:

--------------- ALLOC 0xffffff803276ec00 : index 393 : ztime 21643824 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
and indeed, list any index

(gdb) zstack 394

--------------- ALLOC 0xffffff8032d60700 : index 394 : ztime 21648810 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
How many times was zfs_kmem_alloc involved in the leaked allocs?

(gdb) countpcs 0xffffff7f80e847df
occurred 3999 times in log (100% of records)
</syntaxhighlight>

At least we know it is our fault.

How many times is it arc_buf_alloc?

<syntaxhighlight lang="text">
(gdb) countpcs 0xffffff7f80e90649
occurred 2390 times in log (59% of records)
</syntaxhighlight>

=== Memory Architecture ===

ZFS is designed to aggressively cache filesystem data in main memory. The result of this caching can be significant filesystem performance improvement.

Selection of an allocator has been very challenging on OS X. In the last year we have evolved from:
* Direct call to OSMalloc - a very low level allocator in the kernel - rejected because of slow performance and because the minimum allocation size is one page (4k)
* Direct call to zalloc - the OS X zones allocator - rejected because only 25% of the machines memory can be accessed (50% under some circumstances), and because the result of exceeding this limit is a kernel panic with no other feedback mechanisms available.
* Direct call to bmalloc - bmalloc was a home grown slice allocator that allocated slices of memory from the kernel page allocator, and subdivided into smaller units of allocation to use by ZFS. This was quite successful but very space inefficient. Was used in O3X 1.2.7 and 1.3.0. At this stage we had no real response to memory pressure in the machine, so the total memory allocation to O3X was kept to 50% of the machine.
* Implementation of kmem and vmem allocators using code from Illumos. Provision of a memory pressure monitor mechanism - we are now able to allocate most of the machines memory to ZFS, and scale that back when the machine experiences memory pressure.

O3X has the Solaris Porting Layer (SPL). The SPL has long since provided the Illumos kmem.h API for use by ZFS. In O3X releases up to 1.3.0 the kmem implementation has been a stub that passes allocation requests to an underlying allocator. In O3X 1.3.0 we were still missing some key behaviours in the allocator - efficient lifecycle control of objects, and an effective response to memory pressure in the machine, and the allocator was not very space efficient because of metadata overheads in bmalloc. We were also not convinced that bmalloc represented the state of the art.

Our strategy was to determine how much of the Illumos allocator could be implemented on OS X. After a series of experiments where we implemented significant portions of the kmem code from illumos on top of bmalloc, we had learned enough to take the final step of essentially copying the entire kmem/vmem allocator stack from Illumos. Some portions of the kmem code have been disabled in kmem such as logging, and hot swap CPU support have been disabled due to architectural differences between OS X and Illumos.

By default kmem/vmem require a certain level of performance from the OS page allocator. It is easy to overwhelm the OS X page allocator. We tuned vmem to use 512Kb chunks of memory from the page allocator rather than the smaller allocations that vmem prefers. This is less than ideal as it reduces the ability for vmem to smoothly release memory to the page allocator when the machine is under pressure. While we have an adequately performing solution now, there will always be a tension between our allocator and OS X itself. OS X only provides minimal mechanisms to observe and respond to memory pressure in the machine, so we are somewhat limited in what can be achieved in this regard.

References:

Jeff Bonwicks paper - kmem and vmem implement this design. https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/

=== Detecting memory handling errors ===

The kmem allocator has an internal diagnostic mode. In diagnostic mode the allocator instruments heap memory with various features and markers as it is allocated and released by application code. These markers are checked as the program runs, and can determine when an application has exhibited one or more of a set of common memory handling errors. The debugging mode is disabled by default as it carries a significant performance penalty.

The memory handling errors that can be detected include:
* Modify after free
* Write past end of buffer
* Free of memory not managed by kmem
* Double free of memory
* Various other corruptions
* Freed size != allocated size
* Freed address != allocated address

Debug mode is enabled by compiling with the preprocessor symbol DEBUG defined. At a minimum spl-kmem.c and spl-osx.c need to see this define for the debugging features to be completely enabled.

In debugging mode you must choose whether kmem will log the fault and then panic, or just log. If you elect to panic, there is a very high chance that the full log message will not be stored in system.log before the OS halts, and you will have to connect to the machine with lldb and use the "systemlog" command to view the diagnostic message. If you elect to not panic, the program will continue to run despite the memory corruption, with undefined consequences. In spl-kmem.c set kmem_panic=0 to log, kmem_panic=1 to log+panic.

Example:

I modified spl_start() to include the following:

{
...
int *p;
for(int i=0; i<20;i++) {
p = (int*)spl_kmem_alloc(1024);
spl_kmem_free(p);
*p = 0;
}

With the debug mode enabled the following was logged:

14/08/2015 5:09:47.000 PM kernel[0]: SPL: kernel memory allocator: buffer modified after being freed
14/08/2015 5:09:47.000 PM kernel[0]: SPL: modification occurred at offset 0x0 (0xdeadbeefdeadbeef replaced by 0xdeadbeef00000000)
14/08/2015 5:09:47.000 PM kernel[0]: SPL: buffer=0xffffff887a87d980 bufctl=0xffffff887a7ad840 cache: kmem_alloc_1152
14/08/2015 5:09:47.000 PM kernel[0]: SPL: previous transaction on buffer 0xffffff887a87d980:
14/08/2015 5:09:47.000 PM kernel[0]: SPL: thread=0 time=T-0.000001383 slab=0xffffff887a5ffe68 cache: kmem_alloc_1152
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _kmem_cache_free_debug + 0x227
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _kmem_cache_free + 0x173
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _zfs_kmem_free + 0x2c4
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _spl_start + 0x2bb
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext5startEb + 0x40b
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext4loadEhhP7OSArray + 0xdd
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext4loadEhhP7OSArray + 0x3e1
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext22loadKextWithIdentifierEP8OSStringbbhhP7OSArray + 0xf2
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZNK11IOCatalogue14isModuleLoadedEP12OSDictionary + 0xe0
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN9IOService15probeCandidatesEP12OSOrderedSet + 0x2c4
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN9IOService14doServiceMatchEj + 0x22a
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN15_IOConfigThread4mainEPvi + 0x13c
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : _call_continuation + 0x17

You can clearly see that spl_start() was present in the trace.

=== Compiling to lower OSX versions ===

If you wish to compile O3X to a specific OSX version, in this case, compiling for 10.9 on a 10.10

SPL:
./configure --with-kernel-headers=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

ZFS:
./configure --with-kernelsrc=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

== Flamegraphs ==

Huge thanks to [http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html BrendanGregg] for so much of the dtrace magic.

dtrace the kernel while running a command:

<syntaxhighlight lang="bash">
$ sudo dtrace -x stackframes=100 -n 'profile-997 /arg0/ {
@[stack()] = count(); } tick-60s { exit(0); }' -o out.stacks
</syntaxhighlight>

It will run for 60 seconds.

Convert it to a flamegraph:

<syntaxhighlight lang="bash">
$ ./stackcollapse.pl out.stacks > out.folded
$ ./flamegraph.pl out.folded > out.svg
</syntaxhighlight>

This is <code>rsync -a /usr/ /BOOM/deletea/</code> running:

[[File:rsyncflamegraph.svg|thumb|rsync flamegraph]]

Or running '''Bonnie++''' in various stages:

<gallery mode="packed-hover">
File:create.svg|Create files in sequential order|alt=[[File:create.svg]]
File:stat.svg|Stat files in sequential order|alt=Stat files in sequential order
File:delete.svg|Delete files in sequential order|alt=Delete files in sequential order
</gallery>

[[File:VX_create.svg|thumb|Create files in sequential order]]

 

[[File:iozone.svg|thumb|IOzone flamegraph]]

[[File:iozoneX.svg|thumb|IOzone flamegraph (untrimmed)]]

 

------

== Iozone ==

Quick peek at how they compare, just to see how much we should improve it by.

HFS+ and ZFS were created on the same virtual disk in VMware. Of course, this is not ideal testing specs, but should serve as an indicator.

The pool was created with

<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 \
-O atime=off \
-O casesensitivity=insensitive \
-O normalization=formD \
BOOM /dev/disk1
</syntaxhighlight>

and the HFS+ file system was created with the standard OS X Disk Utility.app, with everything default (journaled, case-insensitive).

'''Iozone''' was run with standard automode:

<syntaxhighlight lang="bash">
sudo iozone -a -b outfile.xls
</syntaxhighlight>

[[File:hfs2_read.png|thumb|HFS+ read]]
[[File:hfs2_write.png|thumb|HFS+ write]]
[[File:zfs2_read.png|thumb|ZFS read]]
[[File:zfs2_write.png|thumb|ZFS write]]

As a guess, writes need to double, and reads need to triple.

=== VFS ===

[[VFS]]

== File-based zpools for testing==

* create 2 files (each 100 MB) to be used as block devices:
<syntaxhighlight lang="bash">
$ dd if=/dev/zero bs=1m count=100 of=vdisk1
$ dd if=/dev/zero bs=1m count=100 of=vdisk2
</syntaxhighlight>

* attach files as raw disk images:
<syntaxhighlight lang="bash">
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk1
/dev/disk2
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk2
/dev/disk3
</syntaxhighlight>

* create mirrored zpool:
<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 -O casesensitivity=insensitive -O normalization=formD tank mirror disk2 disk3
</syntaxhighlight>

* show zpool:
<syntaxhighlight lang="bash">
$ sudo zpool status
pool: tank
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
disk2 ONLINE 0 0 0
disk3 ONLINE 0 0 0

errors: No known data errors
</syntaxhighlight>

* test ZFS features, find bugs, ...

* export zpool:
<syntaxhighlight lang="bash">
$ sudo zpool export tank
</syntaxhighlight>

* detach raw images:
<syntaxhighlight lang="bash">
$ hdiutil detach disk2
"disk2" unmounted.
"disk2" ejected.
$ hdiutil detach disk3
"disk3" unmounted.
"disk3" ejected.
</syntaxhighlight>

== Platform differences ==

This section is an attempt to outline the differences from ZFS versions of other platforms, as compared to OS X. To assist developers new to the Apple platform, who wishes to assist, or understand, development of the O3X version.

=== Reclaim ===

One of the biggest hassles with OS X is the VFS layer's handling of reclaim. First it is worth noting that "struct vnode" is an opaque type, so we are not allowed to see, nor modify, the contents of a vnode.
(Of course, we could craft a mirror struct of vnode and tailor it to each OS X version where vnode changes. But that is rather hacky.)

Following that, the '''only''' place where you can set the '''vtype''' (VREG, VDIR), '''vdata''' (user pointer to hold the ZFS znode), '''vfsops''' (list of filesystem calls "vnops") etc, is '''only in calling vnode_create()'''.
So there is no way to "allocate an empty vnode, and set its values later". The FreeBSD method of pre-allocating vnodes, to avoid reclaim, can not be done.
ZFS will start a new dmu_tx, then call zfs_mknode which will eventually call vnode_create, so we can not do anything with dmu_tx in those vnops.

The problem is, if vnode_create decides to reclaim, it will do so directly, as the same thread. It will end up in vclean() which can call vnop_fsync, vnop_pageout, vnop_inactive and vnop_reclaim. The first three of these calls, we can
use the API call vnode_isrecycled() to detect if these vnops are called "the normal way", or from vclean. If we come from vclean, and the vnode is doomed, we will do as little as possible. We can not open a new TX, and
we can not use mutex locks (panic: locking against ourselves).

Nor is there any way to defer, or delay, a doomed vnode. If vnop_reclaim returns anything but 0, you find the lovely XNU code of
2205 if (VNOP_RECLAIM(vp, ctx))
2206 panic("vclean: cannot reclaim");
in vfs_subr.c

So, at the moment there is some extra logic in '''zfs_vnop_reclaim''' to handle that we might be re-entrant as the '''vnode_create''' thread.

exception = ((zp->z_sa_hdl != NULL) &&
zp->z_unlinked) ? B_TRUE : B_FALSE;
fastpath = zp->z_fastpath;

if both exception and fastpath are FALSE, we can call direct reclaim right there. As in those cases, no final dmu_tx is caused. Following
the zfs_rmnode->zfs_purgedir->zget and similar paths, exception is set to TRUE.

If exception is TRUE, we add the zp to the reclaim_list, and the separate reclaim_thread will call zfs_rmnode(zp). As a separate thread it can handle calling
dmu_tx.

If fastpath is TRUE, we do no more/nothing in zfs_vnop_reclaim. See below.

=== Fastpath vs Recycle ===

Another interesting aspect is that IllumOS has a delete fastpath. In zfs_remove, if it is detected that the znode can be "deleted_now", it marks the vnode as free and directly calls zfs_znode_delete(), if it can not, adds it to zfs_unlinked_add().

In OS X, there is no way to directly release a vnode. Ie, XNU always has full control of the vnodes. Even if you call vnode_recycle(), the vnode is not released '''until''' vnop_reclaim is called. The vnode can just be marked for later reclaim, but remain active (especially if you are racing against other threads using the same vnode). So in zfs_remove, we attempt to call vnode_recycle(), and only if this returns "1" do we know that vnop_reclaim was called, and we can directly call zfs_znode_delete(). Note that the O3X vnop_reclaim handler then has special code to not do anything with the vnode (zp->z_fastpath) but to only clear out the z_vnode and return.

zp->z_fastpath = B_TRUE;
if (vnode_recycle(vp) == 1) {
/* recycle/reclaim is done, so we can just release now */
zfs_znode_delete(zp, tx);
} else {
/* failed to recycle, so just place it on the unlinked list */
zp->z_fastpath = B_FALSE;
zfs_unlinked_add(zp, tx);
}

There is also a little special lock-handling in zfs_zinactive, since we can call it from inside of a vnode_create() which is called by ZFS with locks held. If this is the case, we do not attempt to acquire locks in zfs_zinactive.

=== snapshot mounts ===

There is no way to cause a mount in XNU kernel. None. At. All. Apple themselves cheated and added a static nfsmount() that we can not call. So instead, we have to jump through a whole bunch of
hoops to get there. We create a fake/virtual /dev/diskX entry for the snapshot. '''diskarbitrationd''' will wake up due to new disk, it will enter the probe phase, which includes calling
all the /System/Library/Filesystems/ bundles. Eventually, zfs.util is called and we reply affirmative. However, automount is disable here, as there is no way to specify a mountpoint with auto.
zfs.util will call DADiskMount to mount it to the correct directory.

This means we have a few more VNOPs in zfs_ctldir.c, as we have to reply with correct information to make mount successful. The first getattr will cause the mount attempt, the DADiskMount call will cause getattr to be called
and we have to pretend to have said entry.

=== spl_vn_rdwr vs vn_rdwr ===

There are two calls to vn_rdwr() in OSX's SPL. The '''spl_vn_rdwr()''' call needs to be used when zfs_onexit is in use. For example, dmu_send.c (zfs recv/send) and zfs_ioc_diff (zfs diff). The XNU implementation of
zfs_onexit (as in calls to '''getf''' and '''releasef''') need to place the internal XNU ''struct fileproc''' in the wrapper ''struct spl_fileproc'', so that '''spl_vn_rdwr()''' can use it to do IO.
This is the only way to do IO on a non-file based vnode (ie, pipe or socket). Other places that call vn_rdwr(), for example vdev_file.c, needs to call the regular vn_rdwr.

=== getattr ===

XNU has a whole bunch of items that it can ask for in vnop_getattr, including VA_NAME, which is used heavily by Finder (especially in the vfs_vget path). Care is needed here to return the correct name,
including for link (hard links) targets. VNOP_LOOKUP records the name that was used in the lookup, so that a following stat call (vnop_getattr) on the vnode will return the correct name if VA_NAME is requested.

Development

2015-08-15T00:14:40Z

101.175.67.14:

[[Category:O3X development]]
You should also familiarize yourself with the [[Project_roadmap|project roadmap]] so that you can put the technical details here in context.

== Kernel ==

=== Debugging with GDB ===

Dealing with [[Panic|panics]].

Apple's documentation: https://developer.apple.com/library/mac/documentation/Darwin/Conceptual/KEXTConcept/KEXTConceptDebugger/debug_tutorial.html

Boot target VM with

<syntaxhighlight lang="bash">
$ sudo nvram boot-args="-v keepsyms=y debug=0x144"
</syntaxhighlight>

Make it panic.

On your development machine, you will need the Kernel Debug Kit. Download it from Apple [https://developer.apple.com/downloads/index.action?q=Kernel%20Debug%20Kit here].

<syntaxhighlight lang="text">
$ gdb /Volumes/Kernelit/mach_kernel
(gdb) source /Volumes/KernelDebugKit/kgmacros
(gdb) target remote-kdp
(gdb) kdp-reattach 192.168.30.133 # obviously use the IP of your target / crashed VM
(gdb) showallkmods
</syntaxhighlight>

Find the addresses for ZFS and SPL modules.

<code>^Z</code> to suspend gdb, or, use another terminal

<syntaxhighlight lang="bash">
^Z
$ sudo kextutil -s /tmp -n \
-k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ \
../spl/module/spl/spl.kext/
</syntaxhighlight>

Then resume gdb, or go back to gdb terminal.
<syntaxhighlight lang="text">
$ fg
(gdb) set kext-symbol-file-path /tmp
(gdb) add-kext /tmp/spl.kext
(gdb) add-kext /tmp/zfs.kext
(gdb) bt
</syntaxhighlight>

=== Debugging with LLDB ===

<syntaxhighlight lang="bash">
$ echo "settings set target.load-script-from-symbol-file true" >> ~/.lldbinit
$ lldb /Volumes/KernelDebugKit/mach_kernel # From Yosemite, "/Library/Developer/KDKs/KDK_10.10_14B25.kdk/System/Library/Kernels/kernel"
(lldb) kdp-remote 192.168.30.146
(lldb) showallkmods
(lldb) addkext -F /tmp/spl.kext/Contents/MacOS/spl 0xffffff7f8ebb0000 (Address from showallkmods)
(lldb) addkext -F /tmp/zfs.kext/Contents/MacOS/zfs 0xffffff7f8ebbf000
</syntaxhighlight>

Then follow the guide for GDB above.

=== Non-panic ===

If you prefer to work in GDB, you can always panic a kernel with

<syntaxhighlight lang="bash">
$ sudo dtrace -w -n "BEGIN{ panic();}"
</syntaxhighlight>

But this was revealing:

<syntaxhighlight lang="bash">
$ sudo /usr/libexec/stackshot -i -f /tmp/stackshot.log
$ sudo symstacks.rb -f /tmp/stackshot.log -s -w /tmp/trace.txt
$ less /tmp/trace.txt
</syntaxhighlight>

Note that my hang is here:

<syntaxhighlight lang="text">
PID: 156
Process: zpool
Thread ID: 0x4e2
Thread state: 0x9 == TH_WAIT |TH_UNINT
Thread wait_event: 0xffffff8006608a6c
Kernel stack:
machine_switch_context (in mach_kernel) + 366 (0xffffff80002b3d3e)
0xffffff800022e711 (in mach_kernel) + 1281 (0xffffff800022e711)
thread_block_reason (in mach_kernel) + 300 (0xffffff800022d9dc)
lck_mtx_sleep (in mach_kernel) + 78 (0xffffff80002265ce)
0xffffff8000569ef6 (in mach_kernel) + 246 (0xffffff8000569ef6)
msleep (in mach_kernel) + 116 (0xffffff800056a2e4)
0xffffff7f80e52a76 (0xffffff7f80e52a76)
0xffffff7f80e53fae (0xffffff7f80e53fae)
0xffffff7f80e54173 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)
0xffffff7f80f2bb4e (0xffffff7f80f2bb4e)
0xffffff7f80f1a9b7 (0xffffff7f80f1a9b7)
0xffffff7f80f1b65f (0xffffff7f80f1b65f)
0xffffff7f80f042ee (0xffffff7f80f042ee)
0xffffff7f80f45c5b (0xffffff7f80f45c5b)
0xffffff7f80f4ce92 (0xffffff7f80f4ce92)
spec_ioctl (in mach_kernel) + 157 (0xffffff8000320bfd)
VNOP_IOCTL (in mach_kernel) + 244 (0xffffff8000311e84)
</syntaxhighlight>

It is a shame that it only shows the kernel symbols, and not inside SPL and ZFS, but we can ask it to load another sym file. (Alas, it cannot handle multiple symbols files. Fix this Apple.)

<syntaxhighlight lang="bash">
$ sudo kextstat #grab the addresses of SPL and ZFS again
$ sudo kextutil -s /tmp -n -k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ ../spl/module/spl/spl.kext/

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.spl.sym
0xffffff800056a2e4 (0xffffff800056a2e4)
spl_cv_wait (in net.lundman.spl.sym) + 54 (0xffffff7f80e52a76)
taskq_wait (in net.lundman.spl.sym) + 78 (0xffffff7f80e53fae)
taskq_destroy (in net.lundman.spl.sym) + 35 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.zfs.sym
0xffffff7f80e54173 (0xffffff7f80e54173)
vdev_open_children (in net.lundman.zfs.sym) + 336 (0xffffff7f80f1a870)
vdev_root_open (in net.lundman.zfs.sym) + 94 (0xffffff7f80f2bb4e)
vdev_open (in net.lundman.zfs.sym) + 311 (0xffffff7f80f1a9b7)
vdev_create (in net.lundman.zfs.sym) + 31 (0xffffff7f80f1b65f)
spa_create (in net.lundman.zfs.sym) + 878 (0xffffff7f80f042ee)
</syntaxhighlight>

Voilà!

=== Memory leaks ===

(Note that this section is only relevant to old O3X implementation that used the zones allocator - we now use our own kmem allocator).

In some cases, you may suspect memory issues, for instance if you saw the following panic:

<syntaxhighlight lang="text">
panic(cpu 1 caller 0xffffff80002438d8): "zalloc: \"kalloc.1024\" (100535 elements) retry fail 3, kfree_nop_count: 0"@/SourceCache/xnu/xnu-2050.7.9/osfmk/kern/zalloc.c:1826
</syntaxhighlight>

To debug this, you can attach GDB and use the zprint command:

<syntaxhighlight lang="text">
(gdb) zprint
ZONE COUNT TOT_SZ MAX_SZ ELT_SZ ALLOC_SZ TOT_ALLOC TOT_FREE NAME
0xffffff8002a89250 1620133 18c1000 22a3599 16 1000 125203838 123583705 kalloc.16 CX
0xffffff8006306c50 110335 35f000 4ce300 32 1000 13634985 13524650 kalloc.32 CX
0xffffff8006306a00 133584 82a000 e6a900 64 1000 26510120 26376536 kalloc.64 CX
0xffffff80063067b0 610090 4a84000 614f4c0 128 1000 50524515 49914425 kalloc.128 CX
0xffffff8006306560 1070398 121a2000 1b5e4d60 256 1000 72534632 71464234 kalloc.256 CX
0xffffff8006306310 399302 d423000 daf26b0 512 1000 39231204 38831902 kalloc.512 CX
0xffffff80063060c0 100404 6231000 c29e980 1024 1000 22949693 22849289 kalloc.1024 CX
0xffffff8006305e70 292 9a000 200000 2048 1000 77633725 77633433 kalloc.2048 CX
</syntaxhighlight>

In this case, kalloc.256 is suspect.

Reboot kernel with zlog=kalloc.256 on the command line, then we can use

<syntaxhighlight lang="text">
(gdb) findoldest
oldest record is at log index 393:

--------------- ALLOC 0xffffff803276ec00 : index 393 : ztime 21643824 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
and indeed, list any index

(gdb) zstack 394

--------------- ALLOC 0xffffff8032d60700 : index 394 : ztime 21648810 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
How many times was zfs_kmem_alloc involved in the leaked allocs?

(gdb) countpcs 0xffffff7f80e847df
occurred 3999 times in log (100% of records)
</syntaxhighlight>

At least we know it is our fault.

How many times is it arc_buf_alloc?

<syntaxhighlight lang="text">
(gdb) countpcs 0xffffff7f80e90649
occurred 2390 times in log (59% of records)
</syntaxhighlight>

=== Memory Architecture ===

ZFS is designed to aggressively cache filesystem data in main memory. The result of this caching can be significant filesystem performance improvement.

Selection of an allocator has been very challenging on OS X. In the last year we have evolved from:
* Direct call to OSMalloc - a very low level allocator in the kernel - rejected because of slow performance and because the minimum allocation size is one page (4k)
* Direct call to zalloc - the OS X zones allocator - rejected because only 25% of the machines memory can be accessed (50% under some circumstances), and because the result of exceeding this limit is a kernel panic with no other feedback mechanisms available.
* Direct call to bmalloc - bmalloc was a home grown slice allocator that allocated slices of memory from the kernel page allocator, and subdivided into smaller units of allocation to use by ZFS. This was quite successful but very space inefficient. Was used in O3X 1.2.7 and 1.3.0. At this stage we had no real response to memory pressure in the machine, so the total memory allocation to O3X was kept to 50% of the machine.
* Implementation of kmem and vmem allocators using code from Illumos. Provision of a memory pressure monitor mechanism - we are now able to allocate most of the machines memory to ZFS, and scale that back when the machine experiences memory pressure.

O3X has the Solaris Porting Layer (SPL). The SPL has long since provided the Illumos kmem.h API for use by ZFS. In O3X releases up to 1.3.0 the kmem implementation has been a stub that passes allocation requests to an underlying allocator. In O3X 1.3.0 we were still missing some key behaviours in the allocator - efficient lifecycle control of objects, and an effective response to memory pressure in the machine, and the allocator was not very space efficient because of metadata overheads in bmalloc. We were also not convinced that bmalloc represented the state of the art.

Our strategy was to determine how much of the Illumos allocator could be implemented on OS X. After a series of experiments where we implemented significant portions of the kmem code from illumos on top of bmalloc, we had learned enough to take the final step of essentially copying the entire kmem/vmem allocator stack from Illumos. Some portions of the kmem code have been disabled in kmem such as logging, and hot swap CPU support have been disabled due to architectural differences between OS X and Illumos.

By default kmem/vmem require a certain level of performance from the OS page allocator. It is easy to overwhelm the OS X page allocator. We tuned vmem to use 512Kb chunks of memory from the page allocator rather than the smaller allocations that vmem prefers. This is less than ideal as it reduces the ability for vmem to smoothly release memory to the page allocator when the machine is under pressure. While we have an adequately performing solution now, there will always be a tension between our allocator and OS X itself. OS X only provides minimal mechanisms to observe and respond to memory pressure in the machine, so we are somewhat limited in what can be achieved in this regard.

References:

Jeff Bonwicks paper - kmem and vmem implement this design. https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/

=== Detecting memory handling errors ===

The kmem allocator has an internal diagnostic mode. In diagnostic mode the allocator instruments heap memory with various features and markers as it is allocated and released by application code. These markers are checked as the program runs, and can determine when an application has exhibited one or more of a set of common memory handling errors. The debugging mode is disabled by default as it carries a significant performance penalty.

The memory handling errors that can be detected include:
* Modify after free
* Write past end of buffer
* Free of memory not managed by kmem
* Double free of memory
* Various other corruptions
* Freed size != allocated size
* Freed address != allocated address

Debug mode is enabled by compiling with the preprocessor symbol DEBUG defined. At a minimum spl-kmem.c and spl-osx.c need to see this define for the debugging features to be completely enabled.

In debugging mode you must choose whether kmem will log the fault and then panic, or just log. If you elect to panic, there is a very high chance that the full log message will not be stored in system.log before the OS halts, and you will have to connect to the machine with lldb and use the "systemlog" command to view the diagnostic message. If you elect to not panic, the program will continue to run despite the memory corruption, with undefined consequences. In spl-kmem.c set kmem_panic=0 to log, kmem_panic=1 to log+panic.

Example:

I modified spl_start() to include the following:

{
...
int *p;
for(int i=0; i<20;i++) {
p = (int*)spl_kmem_alloc(1024);
spl_kmem_free(p);
*p = 0;
}

With the debug mode enabled the following was logged:

14/08/2015 5:09:47.000 PM kernel[0]: SPL: kernel memory allocator: buffer modified after being freed
14/08/2015 5:09:47.000 PM kernel[0]: SPL: modification occurred at offset 0x0 (0xdeadbeefdeadbeef replaced by 0xdeadbeef00000000)
14/08/2015 5:09:47.000 PM kernel[0]: SPL: buffer=0xffffff887a87d980 bufctl=0xffffff887a7ad840 cache: kmem_alloc_1152
14/08/2015 5:09:47.000 PM kernel[0]: SPL: previous transaction on buffer 0xffffff887a87d980:
14/08/2015 5:09:47.000 PM kernel[0]: SPL: thread=0 time=T-0.000001383 slab=0xffffff887a5ffe68 cache: kmem_alloc_1152
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _kmem_cache_free_debug + 0x227
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _kmem_cache_free + 0x173
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _zfs_kmem_free + 0x2c4
14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _spl_start + 0x2bb
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext5startEb + 0x40b
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext4loadEhhP7OSArray + 0xdd
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext4loadEhhP7OSArray + 0x3e1
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext22loadKextWithIdentifierEP8OSStringbbhhP7OSArray + 0xf2
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZNK11IOCatalogue14isModuleLoadedEP12OSDictionary + 0xe0
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN9IOService15probeCandidatesEP12OSOrderedSet + 0x2c4
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN9IOService14doServiceMatchEj + 0x22a
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN15_IOConfigThread4mainEPvi + 0x13c
14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : _call_continuation + 0x17

You can clearly see that spl_start() was present in the trace

=== Compiling to lower OSX versions ===

If you wish to compile O3X to a specific OSX version, in this case, compiling for 10.9 on a 10.10

SPL:
./configure --with-kernel-headers=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

ZFS:
./configure --with-kernelsrc=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

== Flamegraphs ==

Huge thanks to [http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html BrendanGregg] for so much of the dtrace magic.

dtrace the kernel while running a command:

<syntaxhighlight lang="bash">
$ sudo dtrace -x stackframes=100 -n 'profile-997 /arg0/ {
@[stack()] = count(); } tick-60s { exit(0); }' -o out.stacks
</syntaxhighlight>

It will run for 60 seconds.

Convert it to a flamegraph:

<syntaxhighlight lang="bash">
$ ./stackcollapse.pl out.stacks > out.folded
$ ./flamegraph.pl out.folded > out.svg
</syntaxhighlight>

This is <code>rsync -a /usr/ /BOOM/deletea/</code> running:

[[File:rsyncflamegraph.svg|thumb|rsync flamegraph]]

Or running '''Bonnie++''' in various stages:

<gallery mode="packed-hover">
File:create.svg|Create files in sequential order|alt=[[File:create.svg]]
File:stat.svg|Stat files in sequential order|alt=Stat files in sequential order
File:delete.svg|Delete files in sequential order|alt=Delete files in sequential order
</gallery>

[[File:VX_create.svg|thumb|Create files in sequential order]]

 

[[File:iozone.svg|thumb|IOzone flamegraph]]

[[File:iozoneX.svg|thumb|IOzone flamegraph (untrimmed)]]

 

------

== Iozone ==

Quick peek at how they compare, just to see how much we should improve it by.

HFS+ and ZFS were created on the same virtual disk in VMware. Of course, this is not ideal testing specs, but should serve as an indicator.

The pool was created with

<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 \
-O atime=off \
-O casesensitivity=insensitive \
-O normalization=formD \
BOOM /dev/disk1
</syntaxhighlight>

and the HFS+ file system was created with the standard OS X Disk Utility.app, with everything default (journaled, case-insensitive).

'''Iozone''' was run with standard automode:

<syntaxhighlight lang="bash">
sudo iozone -a -b outfile.xls
</syntaxhighlight>

[[File:hfs2_read.png|thumb|HFS+ read]]
[[File:hfs2_write.png|thumb|HFS+ write]]
[[File:zfs2_read.png|thumb|ZFS read]]
[[File:zfs2_write.png|thumb|ZFS write]]

As a guess, writes need to double, and reads need to triple.

=== VFS ===

[[VFS]]

== File-based zpools for testing==

* create 2 files (each 100 MB) to be used as block devices:
<syntaxhighlight lang="bash">
$ dd if=/dev/zero bs=1m count=100 of=vdisk1
$ dd if=/dev/zero bs=1m count=100 of=vdisk2
</syntaxhighlight>

* attach files as raw disk images:
<syntaxhighlight lang="bash">
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk1
/dev/disk2
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk2
/dev/disk3
</syntaxhighlight>

* create mirrored zpool:
<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 -O casesensitivity=insensitive -O normalization=formD tank mirror disk2 disk3
</syntaxhighlight>

* show zpool:
<syntaxhighlight lang="bash">
$ sudo zpool status
pool: tank
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
disk2 ONLINE 0 0 0
disk3 ONLINE 0 0 0

errors: No known data errors
</syntaxhighlight>

* test ZFS features, find bugs, ...

* export zpool:
<syntaxhighlight lang="bash">
$ sudo zpool export tank
</syntaxhighlight>

* detach raw images:
<syntaxhighlight lang="bash">
$ hdiutil detach disk2
"disk2" unmounted.
"disk2" ejected.
$ hdiutil detach disk3
"disk3" unmounted.
"disk3" ejected.
</syntaxhighlight>

== Platform differences ==

This section is an attempt to outline the differences from ZFS versions of other platforms, as compared to OS X. To assist developers new to the Apple platform, who wishes to assist, or understand, development of the O3X version.

=== Reclaim ===

One of the biggest hassles with OS X is the VFS layer's handling of reclaim. First it is worth noting that "struct vnode" is an opaque type, so we are not allowed to see, nor modify, the contents of a vnode.
(Of course, we could craft a mirror struct of vnode and tailor it to each OS X version where vnode changes. But that is rather hacky.)

Following that, the '''only''' place where you can set the '''vtype''' (VREG, VDIR), '''vdata''' (user pointer to hold the ZFS znode), '''vfsops''' (list of filesystem calls "vnops") etc, is '''only in calling vnode_create()'''.
So there is no way to "allocate an empty vnode, and set its values later". The FreeBSD method of pre-allocating vnodes, to avoid reclaim, can not be done.
ZFS will start a new dmu_tx, then call zfs_mknode which will eventually call vnode_create, so we can not do anything with dmu_tx in those vnops.

The problem is, if vnode_create decides to reclaim, it will do so directly, as the same thread. It will end up in vclean() which can call vnop_fsync, vnop_pageout, vnop_inactive and vnop_reclaim. The first three of these calls, we can
use the API call vnode_isrecycled() to detect if these vnops are called "the normal way", or from vclean. If we come from vclean, and the vnode is doomed, we will do as little as possible. We can not open a new TX, and
we can not use mutex locks (panic: locking against ourselves).

Nor is there any way to defer, or delay, a doomed vnode. If vnop_reclaim returns anything but 0, you find the lovely XNU code of
2205 if (VNOP_RECLAIM(vp, ctx))
2206 panic("vclean: cannot reclaim");
in vfs_subr.c

So, at the moment there is some extra logic in '''zfs_vnop_reclaim''' to handle that we might be re-entrant as the '''vnode_create''' thread.

exception = ((zp->z_sa_hdl != NULL) &&
zp->z_unlinked) ? B_TRUE : B_FALSE;
fastpath = zp->z_fastpath;

if both exception and fastpath are FALSE, we can call direct reclaim right there. As in those cases, no final dmu_tx is caused. Following
the zfs_rmnode->zfs_purgedir->zget and similar paths, exception is set to TRUE.

If exception is TRUE, we add the zp to the reclaim_list, and the separate reclaim_thread will call zfs_rmnode(zp). As a separate thread it can handle calling
dmu_tx.

If fastpath is TRUE, we do no more/nothing in zfs_vnop_reclaim. See below.

=== Fastpath vs Recycle ===

Another interesting aspect is that IllumOS has a delete fastpath. In zfs_remove, if it is detected that the znode can be "deleted_now", it marks the vnode as free and directly calls zfs_znode_delete(), if it can not, adds it to zfs_unlinked_add().

In OS X, there is no way to directly release a vnode. Ie, XNU always has full control of the vnodes. Even if you call vnode_recycle(), the vnode is not released '''until''' vnop_reclaim is called. The vnode can just be marked for later reclaim, but remain active (especially if you are racing against other threads using the same vnode). So in zfs_remove, we attempt to call vnode_recycle(), and only if this returns "1" do we know that vnop_reclaim was called, and we can directly call zfs_znode_delete(). Note that the O3X vnop_reclaim handler then has special code to not do anything with the vnode (zp->z_fastpath) but to only clear out the z_vnode and return.

zp->z_fastpath = B_TRUE;
if (vnode_recycle(vp) == 1) {
/* recycle/reclaim is done, so we can just release now */
zfs_znode_delete(zp, tx);
} else {
/* failed to recycle, so just place it on the unlinked list */
zp->z_fastpath = B_FALSE;
zfs_unlinked_add(zp, tx);
}

There is also a little special lock-handling in zfs_zinactive, since we can call it from inside of a vnode_create() which is called by ZFS with locks held. If this is the case, we do not attempt to acquire locks in zfs_zinactive.

=== snapshot mounts ===

There is no way to cause a mount in XNU kernel. None. At. All. Apple themselves cheated and added a static nfsmount() that we can not call. So instead, we have to jump through a whole bunch of
hoops to get there. We create a fake/virtual /dev/diskX entry for the snapshot. '''diskarbitrationd''' will wake up due to new disk, it will enter the probe phase, which includes calling
all the /System/Library/Filesystems/ bundles. Eventually, zfs.util is called and we reply affirmative. However, automount is disable here, as there is no way to specify a mountpoint with auto.
zfs.util will call DADiskMount to mount it to the correct directory.

This means we have a few more VNOPs in zfs_ctldir.c, as we have to reply with correct information to make mount successful. The first getattr will cause the mount attempt, the DADiskMount call will cause getattr to be called
and we have to pretend to have said entry.

=== spl_vn_rdwr vs vn_rdwr ===

There are two calls to vn_rdwr() in OSX's SPL. The '''spl_vn_rdwr()''' call needs to be used when zfs_onexit is in use. For example, dmu_send.c (zfs recv/send) and zfs_ioc_diff (zfs diff). The XNU implementation of
zfs_onexit (as in calls to '''getf''' and '''releasef''') need to place the internal XNU ''struct fileproc''' in the wrapper ''struct spl_fileproc'', so that '''spl_vn_rdwr()''' can use it to do IO.
This is the only way to do IO on a non-file based vnode (ie, pipe or socket). Other places that call vn_rdwr(), for example vdev_file.c, needs to call the regular vn_rdwr.

=== getattr ===

XNU has a whole bunch of items that it can ask for in vnop_getattr, including VA_NAME, which is used heavily by Finder (especially in the vfs_vget path). Care is needed here to return the correct name,
including for link (hard links) targets. VNOP_LOOKUP records the name that was used in the lookup, so that a following stat call (vnop_getattr) on the vnode will return the correct name if VA_NAME is requested.

Development

2015-08-15T00:13:18Z

101.175.67.14: /* Kernel */

[[Category:O3X development]]
You should also familiarize yourself with the [[Project_roadmap|project roadmap]] so that you can put the technical details here in context.

== Kernel ==

=== Debugging with GDB ===

Dealing with [[Panic|panics]].

Apple's documentation: https://developer.apple.com/library/mac/documentation/Darwin/Conceptual/KEXTConcept/KEXTConceptDebugger/debug_tutorial.html

Boot target VM with

<syntaxhighlight lang="bash">
$ sudo nvram boot-args="-v keepsyms=y debug=0x144"
</syntaxhighlight>

Make it panic.

On your development machine, you will need the Kernel Debug Kit. Download it from Apple [https://developer.apple.com/downloads/index.action?q=Kernel%20Debug%20Kit here].

<syntaxhighlight lang="text">
$ gdb /Volumes/Kernelit/mach_kernel
(gdb) source /Volumes/KernelDebugKit/kgmacros
(gdb) target remote-kdp
(gdb) kdp-reattach 192.168.30.133 # obviously use the IP of your target / crashed VM
(gdb) showallkmods
</syntaxhighlight>

Find the addresses for ZFS and SPL modules.

<code>^Z</code> to suspend gdb, or, use another terminal

<syntaxhighlight lang="bash">
^Z
$ sudo kextutil -s /tmp -n \
-k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ \
../spl/module/spl/spl.kext/
</syntaxhighlight>

Then resume gdb, or go back to gdb terminal.
<syntaxhighlight lang="text">
$ fg
(gdb) set kext-symbol-file-path /tmp
(gdb) add-kext /tmp/spl.kext
(gdb) add-kext /tmp/zfs.kext
(gdb) bt
</syntaxhighlight>

=== Debugging with LLDB ===

<syntaxhighlight lang="bash">
$ echo "settings set target.load-script-from-symbol-file true" >> ~/.lldbinit
$ lldb /Volumes/KernelDebugKit/mach_kernel # From Yosemite, "/Library/Developer/KDKs/KDK_10.10_14B25.kdk/System/Library/Kernels/kernel"
(lldb) kdp-remote 192.168.30.146
(lldb) showallkmods
(lldb) addkext -F /tmp/spl.kext/Contents/MacOS/spl 0xffffff7f8ebb0000 (Address from showallkmods)
(lldb) addkext -F /tmp/zfs.kext/Contents/MacOS/zfs 0xffffff7f8ebbf000
</syntaxhighlight>

Then follow the guide for GDB above.

=== Non-panic ===

If you prefer to work in GDB, you can always panic a kernel with

<syntaxhighlight lang="bash">
$ sudo dtrace -w -n "BEGIN{ panic();}"
</syntaxhighlight>

But this was revealing:

<syntaxhighlight lang="bash">
$ sudo /usr/libexec/stackshot -i -f /tmp/stackshot.log
$ sudo symstacks.rb -f /tmp/stackshot.log -s -w /tmp/trace.txt
$ less /tmp/trace.txt
</syntaxhighlight>

Note that my hang is here:

<syntaxhighlight lang="text">
PID: 156
Process: zpool
Thread ID: 0x4e2
Thread state: 0x9 == TH_WAIT |TH_UNINT
Thread wait_event: 0xffffff8006608a6c
Kernel stack:
machine_switch_context (in mach_kernel) + 366 (0xffffff80002b3d3e)
0xffffff800022e711 (in mach_kernel) + 1281 (0xffffff800022e711)
thread_block_reason (in mach_kernel) + 300 (0xffffff800022d9dc)
lck_mtx_sleep (in mach_kernel) + 78 (0xffffff80002265ce)
0xffffff8000569ef6 (in mach_kernel) + 246 (0xffffff8000569ef6)
msleep (in mach_kernel) + 116 (0xffffff800056a2e4)
0xffffff7f80e52a76 (0xffffff7f80e52a76)
0xffffff7f80e53fae (0xffffff7f80e53fae)
0xffffff7f80e54173 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)
0xffffff7f80f2bb4e (0xffffff7f80f2bb4e)
0xffffff7f80f1a9b7 (0xffffff7f80f1a9b7)
0xffffff7f80f1b65f (0xffffff7f80f1b65f)
0xffffff7f80f042ee (0xffffff7f80f042ee)
0xffffff7f80f45c5b (0xffffff7f80f45c5b)
0xffffff7f80f4ce92 (0xffffff7f80f4ce92)
spec_ioctl (in mach_kernel) + 157 (0xffffff8000320bfd)
VNOP_IOCTL (in mach_kernel) + 244 (0xffffff8000311e84)
</syntaxhighlight>

It is a shame that it only shows the kernel symbols, and not inside SPL and ZFS, but we can ask it to load another sym file. (Alas, it cannot handle multiple symbols files. Fix this Apple.)

<syntaxhighlight lang="bash">
$ sudo kextstat #grab the addresses of SPL and ZFS again
$ sudo kextutil -s /tmp -n -k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ ../spl/module/spl/spl.kext/

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.spl.sym
0xffffff800056a2e4 (0xffffff800056a2e4)
spl_cv_wait (in net.lundman.spl.sym) + 54 (0xffffff7f80e52a76)
taskq_wait (in net.lundman.spl.sym) + 78 (0xffffff7f80e53fae)
taskq_destroy (in net.lundman.spl.sym) + 35 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.zfs.sym
0xffffff7f80e54173 (0xffffff7f80e54173)
vdev_open_children (in net.lundman.zfs.sym) + 336 (0xffffff7f80f1a870)
vdev_root_open (in net.lundman.zfs.sym) + 94 (0xffffff7f80f2bb4e)
vdev_open (in net.lundman.zfs.sym) + 311 (0xffffff7f80f1a9b7)
vdev_create (in net.lundman.zfs.sym) + 31 (0xffffff7f80f1b65f)
spa_create (in net.lundman.zfs.sym) + 878 (0xffffff7f80f042ee)
</syntaxhighlight>

Voilà!

=== Memory leaks ===

(Note that this section is only relevant to old O3X implementation that used the zones allocator - we now use our own kmem allocator).

In some cases, you may suspect memory issues, for instance if you saw the following panic:

<syntaxhighlight lang="text">
panic(cpu 1 caller 0xffffff80002438d8): "zalloc: \"kalloc.1024\" (100535 elements) retry fail 3, kfree_nop_count: 0"@/SourceCache/xnu/xnu-2050.7.9/osfmk/kern/zalloc.c:1826
</syntaxhighlight>

To debug this, you can attach GDB and use the zprint command:

<syntaxhighlight lang="text">
(gdb) zprint
ZONE COUNT TOT_SZ MAX_SZ ELT_SZ ALLOC_SZ TOT_ALLOC TOT_FREE NAME
0xffffff8002a89250 1620133 18c1000 22a3599 16 1000 125203838 123583705 kalloc.16 CX
0xffffff8006306c50 110335 35f000 4ce300 32 1000 13634985 13524650 kalloc.32 CX
0xffffff8006306a00 133584 82a000 e6a900 64 1000 26510120 26376536 kalloc.64 CX
0xffffff80063067b0 610090 4a84000 614f4c0 128 1000 50524515 49914425 kalloc.128 CX
0xffffff8006306560 1070398 121a2000 1b5e4d60 256 1000 72534632 71464234 kalloc.256 CX
0xffffff8006306310 399302 d423000 daf26b0 512 1000 39231204 38831902 kalloc.512 CX
0xffffff80063060c0 100404 6231000 c29e980 1024 1000 22949693 22849289 kalloc.1024 CX
0xffffff8006305e70 292 9a000 200000 2048 1000 77633725 77633433 kalloc.2048 CX
</syntaxhighlight>

In this case, kalloc.256 is suspect.

Reboot kernel with zlog=kalloc.256 on the command line, then we can use

<syntaxhighlight lang="text">
(gdb) findoldest
oldest record is at log index 393:

--------------- ALLOC 0xffffff803276ec00 : index 393 : ztime 21643824 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
and indeed, list any index

(gdb) zstack 394

--------------- ALLOC 0xffffff8032d60700 : index 394 : ztime 21648810 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
How many times was zfs_kmem_alloc involved in the leaked allocs?

(gdb) countpcs 0xffffff7f80e847df
occurred 3999 times in log (100% of records)
</syntaxhighlight>

At least we know it is our fault.

How many times is it arc_buf_alloc?

<syntaxhighlight lang="text">
(gdb) countpcs 0xffffff7f80e90649
occurred 2390 times in log (59% of records)
</syntaxhighlight>

=== Memory Architecture ===

ZFS is designed to aggressively cache filesystem data in main memory. The result of this caching can be significant filesystem performance improvement.

Selection of an allocator has been very challenging on OS X. In the last year we have evolved from:
* Direct call to OSMalloc - a very low level allocator in the kernel - rejected because of slow performance and because the minimum allocation size is one page (4k)
* Direct call to zalloc - the OS X zones allocator - rejected because only 25% of the machines memory can be accessed (50% under some circumstances), and because the result of exceeding this limit is a kernel panic with no other feedback mechanisms available.
* Direct call to bmalloc - bmalloc was a home grown slice allocator that allocated slices of memory from the kernel page allocator, and subdivided into smaller units of allocation to use by ZFS. This was quite successful but very space inefficient. Was used in O3X 1.2.7 and 1.3.0. At this stage we had no real response to memory pressure in the machine, so the total memory allocation to O3X was kept to 50% of the machine.
* Implementation of kmem and vmem allocators using code from Illumos. Provision of a memory pressure monitor mechanism - we are now able to allocate most of the machines memory to ZFS, and scale that back when the machine experiences memory pressure.

O3X has the Solaris Porting Layer (SPL). The SPL has long since provided the Illumos kmem.h API for use by ZFS. In O3X releases up to 1.3.0 the kmem implementation has been a stub that passes allocation requests to an underlying allocator. In O3X 1.3.0 we were still missing some key behaviours in the allocator - efficient lifecycle control of objects, and an effective response to memory pressure in the machine, and the allocator was not very space efficient because of metadata overheads in bmalloc. We were also not convinced that bmalloc represented the state of the art.

Our strategy was to determine how much of the Illumos allocator could be implemented on OS X. After a series of experiments where we implemented significant portions of the kmem code from illumos on top of bmalloc, we had learned enough to take the final step of essentially copying the entire kmem/vmem allocator stack from Illumos. Some portions of the kmem code have been disabled in kmem such as logging, and hot swap CPU support have been disabled due to architectural differences between OS X and Illumos.

By default kmem/vmem require a certain level of performance from the OS page allocator. It is easy to overwhelm the OS X page allocator. We tuned vmem to use 512Kb chunks of memory from the page allocator rather than the smaller allocations that vmem prefers. This is less than ideal as it reduces the ability for vmem to smoothly release memory to the page allocator when the machine is under pressure. While we have an adequately performing solution now, there will always be a tension between our allocator and OS X itself. OS X only provides minimal mechanisms to observe and respond to memory pressure in the machine, so we are somewhat limited in what can be achieved in this regard.

References:

Jeff Bonwicks paper - kmem and vmem implement this design. https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/

=== Detecting memory handling errors ===

The kmem allocator has an internal diagnostic mode. In diagnostic mode the allocator instruments heap memory with various features and markers as it is allocated and released by application code. These markers are checked as the program runs, and can determine when an application has exhibited one or more of a set of common memory handling errors. The debugging mode is disabled by default as it carries a significant performance penalty.

The memory handling errors that can be detected include:
* Modify after free
* Write past end of buffer
* Free of memory not managed by kmem
* Double free of memory
* Various other corruptions
* Freed size != allocated size
* Freed address != allocated address

Debug mode is enabled by compiling with the preprocessor symbol DEBUG defined. At a minimum spl-kmem.c and spl-osx.c need to see this define for the debugging features to be completely enabled.

In debugging mode you must choose whether kmem will log the fault and then panic, or just log. If you elect to panic, there is a very high chance that the full log message will not be stored in system.log before the OS halts, and you will have to connect to the machine with lldb and use the "systemlog" command to view the diagnostic message. If you elect to not panic, the program will continue to run despite the memory corruption, with undefined consequences. In spl-kmem.c set kmem_panic=0 to log, kmem_panic=1 to log+panic.

Example:

I modified spl_start() to include the following:

{
...
int *p;
for(int i=0; i<20;i++) {
p = (int*)spl_kmem_alloc(1024);
spl_kmem_free(p);
*p = 0;
}

With the debug mode enabled the following was logged:

You can clearly see that spl_start() was present in the trace

=== Compiling to lower OSX versions ===

If you wish to compile O3X to a specific OSX version, in this case, compiling for 10.9 on a 10.10

SPL:
./configure --with-kernel-headers=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

ZFS:
./configure --with-kernelsrc=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

== Flamegraphs ==

Huge thanks to [http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html BrendanGregg] for so much of the dtrace magic.

dtrace the kernel while running a command:

<syntaxhighlight lang="bash">
$ sudo dtrace -x stackframes=100 -n 'profile-997 /arg0/ {
@[stack()] = count(); } tick-60s { exit(0); }' -o out.stacks
</syntaxhighlight>

It will run for 60 seconds.

Convert it to a flamegraph:

<syntaxhighlight lang="bash">
$ ./stackcollapse.pl out.stacks > out.folded
$ ./flamegraph.pl out.folded > out.svg
</syntaxhighlight>

This is <code>rsync -a /usr/ /BOOM/deletea/</code> running:

[[File:rsyncflamegraph.svg|thumb|rsync flamegraph]]

Or running '''Bonnie++''' in various stages:

<gallery mode="packed-hover">
File:create.svg|Create files in sequential order|alt=[[File:create.svg]]
File:stat.svg|Stat files in sequential order|alt=Stat files in sequential order
File:delete.svg|Delete files in sequential order|alt=Delete files in sequential order
</gallery>

[[File:VX_create.svg|thumb|Create files in sequential order]]

 

[[File:iozone.svg|thumb|IOzone flamegraph]]

[[File:iozoneX.svg|thumb|IOzone flamegraph (untrimmed)]]

 

------

== Iozone ==

Quick peek at how they compare, just to see how much we should improve it by.

HFS+ and ZFS were created on the same virtual disk in VMware. Of course, this is not ideal testing specs, but should serve as an indicator.

The pool was created with

<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 \
-O atime=off \
-O casesensitivity=insensitive \
-O normalization=formD \
BOOM /dev/disk1
</syntaxhighlight>

and the HFS+ file system was created with the standard OS X Disk Utility.app, with everything default (journaled, case-insensitive).

'''Iozone''' was run with standard automode:

<syntaxhighlight lang="bash">
sudo iozone -a -b outfile.xls
</syntaxhighlight>

[[File:hfs2_read.png|thumb|HFS+ read]]
[[File:hfs2_write.png|thumb|HFS+ write]]
[[File:zfs2_read.png|thumb|ZFS read]]
[[File:zfs2_write.png|thumb|ZFS write]]

As a guess, writes need to double, and reads need to triple.

=== VFS ===

[[VFS]]

== File-based zpools for testing==

* create 2 files (each 100 MB) to be used as block devices:
<syntaxhighlight lang="bash">
$ dd if=/dev/zero bs=1m count=100 of=vdisk1
$ dd if=/dev/zero bs=1m count=100 of=vdisk2
</syntaxhighlight>

* attach files as raw disk images:
<syntaxhighlight lang="bash">
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk1
/dev/disk2
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk2
/dev/disk3
</syntaxhighlight>

* create mirrored zpool:
<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 -O casesensitivity=insensitive -O normalization=formD tank mirror disk2 disk3
</syntaxhighlight>

* show zpool:
<syntaxhighlight lang="bash">
$ sudo zpool status
pool: tank
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
disk2 ONLINE 0 0 0
disk3 ONLINE 0 0 0

errors: No known data errors
</syntaxhighlight>

* test ZFS features, find bugs, ...

* export zpool:
<syntaxhighlight lang="bash">
$ sudo zpool export tank
</syntaxhighlight>

* detach raw images:
<syntaxhighlight lang="bash">
$ hdiutil detach disk2
"disk2" unmounted.
"disk2" ejected.
$ hdiutil detach disk3
"disk3" unmounted.
"disk3" ejected.
</syntaxhighlight>

== Platform differences ==

This section is an attempt to outline the differences from ZFS versions of other platforms, as compared to OS X. To assist developers new to the Apple platform, who wishes to assist, or understand, development of the O3X version.

=== Reclaim ===

One of the biggest hassles with OS X is the VFS layer's handling of reclaim. First it is worth noting that "struct vnode" is an opaque type, so we are not allowed to see, nor modify, the contents of a vnode.
(Of course, we could craft a mirror struct of vnode and tailor it to each OS X version where vnode changes. But that is rather hacky.)

Following that, the '''only''' place where you can set the '''vtype''' (VREG, VDIR), '''vdata''' (user pointer to hold the ZFS znode), '''vfsops''' (list of filesystem calls "vnops") etc, is '''only in calling vnode_create()'''.
So there is no way to "allocate an empty vnode, and set its values later". The FreeBSD method of pre-allocating vnodes, to avoid reclaim, can not be done.
ZFS will start a new dmu_tx, then call zfs_mknode which will eventually call vnode_create, so we can not do anything with dmu_tx in those vnops.

The problem is, if vnode_create decides to reclaim, it will do so directly, as the same thread. It will end up in vclean() which can call vnop_fsync, vnop_pageout, vnop_inactive and vnop_reclaim. The first three of these calls, we can
use the API call vnode_isrecycled() to detect if these vnops are called "the normal way", or from vclean. If we come from vclean, and the vnode is doomed, we will do as little as possible. We can not open a new TX, and
we can not use mutex locks (panic: locking against ourselves).

Nor is there any way to defer, or delay, a doomed vnode. If vnop_reclaim returns anything but 0, you find the lovely XNU code of
2205 if (VNOP_RECLAIM(vp, ctx))
2206 panic("vclean: cannot reclaim");
in vfs_subr.c

So, at the moment there is some extra logic in '''zfs_vnop_reclaim''' to handle that we might be re-entrant as the '''vnode_create''' thread.

exception = ((zp->z_sa_hdl != NULL) &&
zp->z_unlinked) ? B_TRUE : B_FALSE;
fastpath = zp->z_fastpath;

if both exception and fastpath are FALSE, we can call direct reclaim right there. As in those cases, no final dmu_tx is caused. Following
the zfs_rmnode->zfs_purgedir->zget and similar paths, exception is set to TRUE.

If exception is TRUE, we add the zp to the reclaim_list, and the separate reclaim_thread will call zfs_rmnode(zp). As a separate thread it can handle calling
dmu_tx.

If fastpath is TRUE, we do no more/nothing in zfs_vnop_reclaim. See below.

=== Fastpath vs Recycle ===

Another interesting aspect is that IllumOS has a delete fastpath. In zfs_remove, if it is detected that the znode can be "deleted_now", it marks the vnode as free and directly calls zfs_znode_delete(), if it can not, adds it to zfs_unlinked_add().

In OS X, there is no way to directly release a vnode. Ie, XNU always has full control of the vnodes. Even if you call vnode_recycle(), the vnode is not released '''until''' vnop_reclaim is called. The vnode can just be marked for later reclaim, but remain active (especially if you are racing against other threads using the same vnode). So in zfs_remove, we attempt to call vnode_recycle(), and only if this returns "1" do we know that vnop_reclaim was called, and we can directly call zfs_znode_delete(). Note that the O3X vnop_reclaim handler then has special code to not do anything with the vnode (zp->z_fastpath) but to only clear out the z_vnode and return.

zp->z_fastpath = B_TRUE;
if (vnode_recycle(vp) == 1) {
/* recycle/reclaim is done, so we can just release now */
zfs_znode_delete(zp, tx);
} else {
/* failed to recycle, so just place it on the unlinked list */
zp->z_fastpath = B_FALSE;
zfs_unlinked_add(zp, tx);
}

There is also a little special lock-handling in zfs_zinactive, since we can call it from inside of a vnode_create() which is called by ZFS with locks held. If this is the case, we do not attempt to acquire locks in zfs_zinactive.

=== snapshot mounts ===

There is no way to cause a mount in XNU kernel. None. At. All. Apple themselves cheated and added a static nfsmount() that we can not call. So instead, we have to jump through a whole bunch of
hoops to get there. We create a fake/virtual /dev/diskX entry for the snapshot. '''diskarbitrationd''' will wake up due to new disk, it will enter the probe phase, which includes calling
all the /System/Library/Filesystems/ bundles. Eventually, zfs.util is called and we reply affirmative. However, automount is disable here, as there is no way to specify a mountpoint with auto.
zfs.util will call DADiskMount to mount it to the correct directory.

This means we have a few more VNOPs in zfs_ctldir.c, as we have to reply with correct information to make mount successful. The first getattr will cause the mount attempt, the DADiskMount call will cause getattr to be called
and we have to pretend to have said entry.

=== spl_vn_rdwr vs vn_rdwr ===

There are two calls to vn_rdwr() in OSX's SPL. The '''spl_vn_rdwr()''' call needs to be used when zfs_onexit is in use. For example, dmu_send.c (zfs recv/send) and zfs_ioc_diff (zfs diff). The XNU implementation of
zfs_onexit (as in calls to '''getf''' and '''releasef''') need to place the internal XNU ''struct fileproc''' in the wrapper ''struct spl_fileproc'', so that '''spl_vn_rdwr()''' can use it to do IO.
This is the only way to do IO on a non-file based vnode (ie, pipe or socket). Other places that call vn_rdwr(), for example vdev_file.c, needs to call the regular vn_rdwr.

=== getattr ===

XNU has a whole bunch of items that it can ask for in vnop_getattr, including VA_NAME, which is used heavily by Finder (especially in the vfs_vget path). Care is needed here to return the correct name,
including for link (hard links) targets. VNOP_LOOKUP records the name that was used in the lookup, so that a following stat call (vnop_getattr) on the vnode will return the correct name if VA_NAME is requested.

Development

2015-08-14T23:49:26Z

101.175.67.14: /* Memory leaks */

[[Category:O3X development]]
You should also familiarize yourself with the [[Project_roadmap|project roadmap]] so that you can put the technical details here in context.

== Kernel ==

=== Debugging with GDB ===

Dealing with [[Panic|panics]].

Apple's documentation: https://developer.apple.com/library/mac/documentation/Darwin/Conceptual/KEXTConcept/KEXTConceptDebugger/debug_tutorial.html

Boot target VM with

<syntaxhighlight lang="bash">
$ sudo nvram boot-args="-v keepsyms=y debug=0x144"
</syntaxhighlight>

Make it panic.

On your development machine, you will need the Kernel Debug Kit. Download it from Apple [https://developer.apple.com/downloads/index.action?q=Kernel%20Debug%20Kit here].

<syntaxhighlight lang="text">
$ gdb /Volumes/Kernelit/mach_kernel
(gdb) source /Volumes/KernelDebugKit/kgmacros
(gdb) target remote-kdp
(gdb) kdp-reattach 192.168.30.133 # obviously use the IP of your target / crashed VM
(gdb) showallkmods
</syntaxhighlight>

Find the addresses for ZFS and SPL modules.

<code>^Z</code> to suspend gdb, or, use another terminal

<syntaxhighlight lang="bash">
^Z
$ sudo kextutil -s /tmp -n \
-k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ \
../spl/module/spl/spl.kext/
</syntaxhighlight>

Then resume gdb, or go back to gdb terminal.
<syntaxhighlight lang="text">
$ fg
(gdb) set kext-symbol-file-path /tmp
(gdb) add-kext /tmp/spl.kext
(gdb) add-kext /tmp/zfs.kext
(gdb) bt
</syntaxhighlight>

=== Debugging with LLDB ===

<syntaxhighlight lang="bash">
$ echo "settings set target.load-script-from-symbol-file true" >> ~/.lldbinit
$ lldb /Volumes/KernelDebugKit/mach_kernel # From Yosemite, "/Library/Developer/KDKs/KDK_10.10_14B25.kdk/System/Library/Kernels/kernel"
(lldb) kdp-remote 192.168.30.146
(lldb) showallkmods
(lldb) addkext -F /tmp/spl.kext/Contents/MacOS/spl 0xffffff7f8ebb0000 (Address from showallkmods)
(lldb) addkext -F /tmp/zfs.kext/Contents/MacOS/zfs 0xffffff7f8ebbf000
</syntaxhighlight>

Then follow the guide for GDB above.

=== Non-panic ===

If you prefer to work in GDB, you can always panic a kernel with

<syntaxhighlight lang="bash">
$ sudo dtrace -w -n "BEGIN{ panic();}"
</syntaxhighlight>

But this was revealing:

<syntaxhighlight lang="bash">
$ sudo /usr/libexec/stackshot -i -f /tmp/stackshot.log
$ sudo symstacks.rb -f /tmp/stackshot.log -s -w /tmp/trace.txt
$ less /tmp/trace.txt
</syntaxhighlight>

Note that my hang is here:

<syntaxhighlight lang="text">
PID: 156
Process: zpool
Thread ID: 0x4e2
Thread state: 0x9 == TH_WAIT |TH_UNINT
Thread wait_event: 0xffffff8006608a6c
Kernel stack:
machine_switch_context (in mach_kernel) + 366 (0xffffff80002b3d3e)
0xffffff800022e711 (in mach_kernel) + 1281 (0xffffff800022e711)
thread_block_reason (in mach_kernel) + 300 (0xffffff800022d9dc)
lck_mtx_sleep (in mach_kernel) + 78 (0xffffff80002265ce)
0xffffff8000569ef6 (in mach_kernel) + 246 (0xffffff8000569ef6)
msleep (in mach_kernel) + 116 (0xffffff800056a2e4)
0xffffff7f80e52a76 (0xffffff7f80e52a76)
0xffffff7f80e53fae (0xffffff7f80e53fae)
0xffffff7f80e54173 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)
0xffffff7f80f2bb4e (0xffffff7f80f2bb4e)
0xffffff7f80f1a9b7 (0xffffff7f80f1a9b7)
0xffffff7f80f1b65f (0xffffff7f80f1b65f)
0xffffff7f80f042ee (0xffffff7f80f042ee)
0xffffff7f80f45c5b (0xffffff7f80f45c5b)
0xffffff7f80f4ce92 (0xffffff7f80f4ce92)
spec_ioctl (in mach_kernel) + 157 (0xffffff8000320bfd)
VNOP_IOCTL (in mach_kernel) + 244 (0xffffff8000311e84)
</syntaxhighlight>

It is a shame that it only shows the kernel symbols, and not inside SPL and ZFS, but we can ask it to load another sym file. (Alas, it cannot handle multiple symbols files. Fix this Apple.)

<syntaxhighlight lang="bash">
$ sudo kextstat #grab the addresses of SPL and ZFS again
$ sudo kextutil -s /tmp -n -k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ ../spl/module/spl/spl.kext/

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.spl.sym
0xffffff800056a2e4 (0xffffff800056a2e4)
spl_cv_wait (in net.lundman.spl.sym) + 54 (0xffffff7f80e52a76)
taskq_wait (in net.lundman.spl.sym) + 78 (0xffffff7f80e53fae)
taskq_destroy (in net.lundman.spl.sym) + 35 (0xffffff7f80e54173)
0xffffff7f80f1a870 (0xffffff7f80f1a870)

$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.zfs.sym
0xffffff7f80e54173 (0xffffff7f80e54173)
vdev_open_children (in net.lundman.zfs.sym) + 336 (0xffffff7f80f1a870)
vdev_root_open (in net.lundman.zfs.sym) + 94 (0xffffff7f80f2bb4e)
vdev_open (in net.lundman.zfs.sym) + 311 (0xffffff7f80f1a9b7)
vdev_create (in net.lundman.zfs.sym) + 31 (0xffffff7f80f1b65f)
spa_create (in net.lundman.zfs.sym) + 878 (0xffffff7f80f042ee)
</syntaxhighlight>

Voilà!

=== Memory leaks ===

(Note that this section is only relevant to old O3X implementation that used the zones allocator - we now use our own kmem allocator).

In some cases, you may suspect memory issues, for instance if you saw the following panic:

<syntaxhighlight lang="text">
panic(cpu 1 caller 0xffffff80002438d8): "zalloc: \"kalloc.1024\" (100535 elements) retry fail 3, kfree_nop_count: 0"@/SourceCache/xnu/xnu-2050.7.9/osfmk/kern/zalloc.c:1826
</syntaxhighlight>

To debug this, you can attach GDB and use the zprint command:

<syntaxhighlight lang="text">
(gdb) zprint
ZONE COUNT TOT_SZ MAX_SZ ELT_SZ ALLOC_SZ TOT_ALLOC TOT_FREE NAME
0xffffff8002a89250 1620133 18c1000 22a3599 16 1000 125203838 123583705 kalloc.16 CX
0xffffff8006306c50 110335 35f000 4ce300 32 1000 13634985 13524650 kalloc.32 CX
0xffffff8006306a00 133584 82a000 e6a900 64 1000 26510120 26376536 kalloc.64 CX
0xffffff80063067b0 610090 4a84000 614f4c0 128 1000 50524515 49914425 kalloc.128 CX
0xffffff8006306560 1070398 121a2000 1b5e4d60 256 1000 72534632 71464234 kalloc.256 CX
0xffffff8006306310 399302 d423000 daf26b0 512 1000 39231204 38831902 kalloc.512 CX
0xffffff80063060c0 100404 6231000 c29e980 1024 1000 22949693 22849289 kalloc.1024 CX
0xffffff8006305e70 292 9a000 200000 2048 1000 77633725 77633433 kalloc.2048 CX
</syntaxhighlight>

In this case, kalloc.256 is suspect.

Reboot kernel with zlog=kalloc.256 on the command line, then we can use

<syntaxhighlight lang="text">
(gdb) findoldest
oldest record is at log index 393:

--------------- ALLOC 0xffffff803276ec00 : index 393 : ztime 21643824 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
and indeed, list any index

(gdb) zstack 394

--------------- ALLOC 0xffffff8032d60700 : index 394 : ztime 21648810 -------------
0xffffff800024352e <zalloc_canblock+78>: mov %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>: mov %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>: mov %rax,-0x28(%rbp)
How many times was zfs_kmem_alloc involved in the leaked allocs?

(gdb) countpcs 0xffffff7f80e847df
occurred 3999 times in log (100% of records)
</syntaxhighlight>

At least we know it is our fault.

How many times is it arc_buf_alloc?

<syntaxhighlight lang="text">
(gdb) countpcs 0xffffff7f80e90649
occurred 2390 times in log (59% of records)
</syntaxhighlight>

=== Memory Architecture ===

ZFS is designed to aggressively cache filesystem data in main memory. The result of this caching can be significant filesystem performance improvement.

Selection of an allocator has been very challenging on OS X. In the last year we have evolved from:
* Direct call to OSMalloc - a very low level allocator in the kernel - rejected because of slow performance and because the minimum allocation size is one page (4k)
* Direct call to zalloc - the OS X zones allocator - rejected because only 25% of the machines memory can be accessed (50% under some circumstances), and because the result of exceeding this limit is a kernel panic with no other feedback mechanisms available.
* Direct call to bmalloc - bmalloc was a home grown slice allocator that allocated slices of memory from the kernel page allocator, and subdivided into smaller units of allocation to use by ZFS. This was quite successful but very space inefficient. Was used in O3X 1.2.7 and 1.3.0. At this stage we had no real response to memory pressure in the machine, so the total memory allocation to O3X was kept to 50% of the machine.
* Implementation of kmem and vmem allocators using code from Illumos. Provision of a memory pressure monitor mechanism - we are now able to allocate most of the machines memory to ZFS, and scale that back when the machine experiences memory pressure.

O3X has the Solaris Porting Layer (SPL). The SPL has long since provided the Illumos kmem.h API for use by ZFS. In O3X releases up to 1.3.0 the kmem implementation has been a stub that passes allocation requests to an underlying allocator. In O3X 1.3.0 we were still missing some key behaviours in the allocator - efficient lifecycle control of objects, and an effective response to memory pressure in the machine, and the allocator was not very space efficient because of metadata overheads in bmalloc. We were also not convinced that bmalloc represented the state of the art.

Our strategy was to determine how much of the Illumos allocator could be implemented on OS X. After a series of experiments where we implemented significant portions of the kmem code from illumos on top of bmalloc, we had learned enough to take the final step of essentially copying the entire kmem/vmem allocator stack from Illumos. Some portions of the kmem code have been disabled in kmem such as logging, and hot swap CPU support have been disabled due to architectural differences between OS X and Illumos.

By default kmem/vmem require a certain level of performance from the OS page allocator. It is easy to overwhelm the OS X page allocator. We tuned vmem to use 512Kb chunks of memory from the page allocator rather than the smaller allocations that vmem prefers. This is less than ideal as it reduces the ability for vmem to smoothly release memory to the page allocator when the machine is under pressure. While we have an adequately performing solution now, there will always be a tension between our allocator and OS X itself. OS X only provides minimal mechanisms to observe and respond to memory pressure in the machine, so we are somewhat limited in what can be achieved in this regard.

References:

Jeff Bonwicks paper - kmem and vmem implement this design. https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/

=== Compiling to lower OSX versions ===

If you wish to compile O3X to a specific OSX version, in this case, compiling for 10.9 on a 10.10

SPL:
./configure --with-kernel-headers=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

ZFS:
./configure --with-kernelsrc=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9

== Flamegraphs ==

Huge thanks to [http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html BrendanGregg] for so much of the dtrace magic.

dtrace the kernel while running a command:

<syntaxhighlight lang="bash">
$ sudo dtrace -x stackframes=100 -n 'profile-997 /arg0/ {
@[stack()] = count(); } tick-60s { exit(0); }' -o out.stacks
</syntaxhighlight>

It will run for 60 seconds.

Convert it to a flamegraph:

<syntaxhighlight lang="bash">
$ ./stackcollapse.pl out.stacks > out.folded
$ ./flamegraph.pl out.folded > out.svg
</syntaxhighlight>

This is <code>rsync -a /usr/ /BOOM/deletea/</code> running:

[[File:rsyncflamegraph.svg|thumb|rsync flamegraph]]

Or running '''Bonnie++''' in various stages:

<gallery mode="packed-hover">
File:create.svg|Create files in sequential order|alt=[[File:create.svg]]
File:stat.svg|Stat files in sequential order|alt=Stat files in sequential order
File:delete.svg|Delete files in sequential order|alt=Delete files in sequential order
</gallery>

[[File:VX_create.svg|thumb|Create files in sequential order]]

 

[[File:iozone.svg|thumb|IOzone flamegraph]]

[[File:iozoneX.svg|thumb|IOzone flamegraph (untrimmed)]]

 

------

== Iozone ==

Quick peek at how they compare, just to see how much we should improve it by.

HFS+ and ZFS were created on the same virtual disk in VMware. Of course, this is not ideal testing specs, but should serve as an indicator.

The pool was created with

<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 \
-O atime=off \
-O casesensitivity=insensitive \
-O normalization=formD \
BOOM /dev/disk1
</syntaxhighlight>

and the HFS+ file system was created with the standard OS X Disk Utility.app, with everything default (journaled, case-insensitive).

'''Iozone''' was run with standard automode:

<syntaxhighlight lang="bash">
sudo iozone -a -b outfile.xls
</syntaxhighlight>

[[File:hfs2_read.png|thumb|HFS+ read]]
[[File:hfs2_write.png|thumb|HFS+ write]]
[[File:zfs2_read.png|thumb|ZFS read]]
[[File:zfs2_write.png|thumb|ZFS write]]

As a guess, writes need to double, and reads need to triple.

=== VFS ===

[[VFS]]

== File-based zpools for testing==

* create 2 files (each 100 MB) to be used as block devices:
<syntaxhighlight lang="bash">
$ dd if=/dev/zero bs=1m count=100 of=vdisk1
$ dd if=/dev/zero bs=1m count=100 of=vdisk2
</syntaxhighlight>

* attach files as raw disk images:
<syntaxhighlight lang="bash">
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk1
/dev/disk2
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk2
/dev/disk3
</syntaxhighlight>

* create mirrored zpool:
<syntaxhighlight lang="bash">
$ sudo zpool create -f -o ashift=12 -O casesensitivity=insensitive -O normalization=formD tank mirror disk2 disk3
</syntaxhighlight>

* show zpool:
<syntaxhighlight lang="bash">
$ sudo zpool status
pool: tank
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
disk2 ONLINE 0 0 0
disk3 ONLINE 0 0 0

errors: No known data errors
</syntaxhighlight>

* test ZFS features, find bugs, ...

* export zpool:
<syntaxhighlight lang="bash">
$ sudo zpool export tank
</syntaxhighlight>

* detach raw images:
<syntaxhighlight lang="bash">
$ hdiutil detach disk2
"disk2" unmounted.
"disk2" ejected.
$ hdiutil detach disk3
"disk3" unmounted.
"disk3" ejected.
</syntaxhighlight>

== Platform differences ==

This section is an attempt to outline the differences from ZFS versions of other platforms, as compared to OS X. To assist developers new to the Apple platform, who wishes to assist, or understand, development of the O3X version.

=== Reclaim ===

One of the biggest hassles with OS X is the VFS layer's handling of reclaim. First it is worth noting that "struct vnode" is an opaque type, so we are not allowed to see, nor modify, the contents of a vnode.
(Of course, we could craft a mirror struct of vnode and tailor it to each OS X version where vnode changes. But that is rather hacky.)

Following that, the '''only''' place where you can set the '''vtype''' (VREG, VDIR), '''vdata''' (user pointer to hold the ZFS znode), '''vfsops''' (list of filesystem calls "vnops") etc, is '''only in calling vnode_create()'''.
So there is no way to "allocate an empty vnode, and set its values later". The FreeBSD method of pre-allocating vnodes, to avoid reclaim, can not be done.
ZFS will start a new dmu_tx, then call zfs_mknode which will eventually call vnode_create, so we can not do anything with dmu_tx in those vnops.

The problem is, if vnode_create decides to reclaim, it will do so directly, as the same thread. It will end up in vclean() which can call vnop_fsync, vnop_pageout, vnop_inactive and vnop_reclaim. The first three of these calls, we can
use the API call vnode_isrecycled() to detect if these vnops are called "the normal way", or from vclean. If we come from vclean, and the vnode is doomed, we will do as little as possible. We can not open a new TX, and
we can not use mutex locks (panic: locking against ourselves).

Nor is there any way to defer, or delay, a doomed vnode. If vnop_reclaim returns anything but 0, you find the lovely XNU code of
2205 if (VNOP_RECLAIM(vp, ctx))
2206 panic("vclean: cannot reclaim");
in vfs_subr.c

So, at the moment there is some extra logic in '''zfs_vnop_reclaim''' to handle that we might be re-entrant as the '''vnode_create''' thread.

exception = ((zp->z_sa_hdl != NULL) &&
zp->z_unlinked) ? B_TRUE : B_FALSE;
fastpath = zp->z_fastpath;

if both exception and fastpath are FALSE, we can call direct reclaim right there. As in those cases, no final dmu_tx is caused. Following
the zfs_rmnode->zfs_purgedir->zget and similar paths, exception is set to TRUE.

If exception is TRUE, we add the zp to the reclaim_list, and the separate reclaim_thread will call zfs_rmnode(zp). As a separate thread it can handle calling
dmu_tx.

If fastpath is TRUE, we do no more/nothing in zfs_vnop_reclaim. See below.

=== Fastpath vs Recycle ===

Another interesting aspect is that IllumOS has a delete fastpath. In zfs_remove, if it is detected that the znode can be "deleted_now", it marks the vnode as free and directly calls zfs_znode_delete(), if it can not, adds it to zfs_unlinked_add().

In OS X, there is no way to directly release a vnode. Ie, XNU always has full control of the vnodes. Even if you call vnode_recycle(), the vnode is not released '''until''' vnop_reclaim is called. The vnode can just be marked for later reclaim, but remain active (especially if you are racing against other threads using the same vnode). So in zfs_remove, we attempt to call vnode_recycle(), and only if this returns "1" do we know that vnop_reclaim was called, and we can directly call zfs_znode_delete(). Note that the O3X vnop_reclaim handler then has special code to not do anything with the vnode (zp->z_fastpath) but to only clear out the z_vnode and return.

zp->z_fastpath = B_TRUE;
if (vnode_recycle(vp) == 1) {
/* recycle/reclaim is done, so we can just release now */
zfs_znode_delete(zp, tx);
} else {
/* failed to recycle, so just place it on the unlinked list */
zp->z_fastpath = B_FALSE;
zfs_unlinked_add(zp, tx);
}

There is also a little special lock-handling in zfs_zinactive, since we can call it from inside of a vnode_create() which is called by ZFS with locks held. If this is the case, we do not attempt to acquire locks in zfs_zinactive.

=== snapshot mounts ===

There is no way to cause a mount in XNU kernel. None. At. All. Apple themselves cheated and added a static nfsmount() that we can not call. So instead, we have to jump through a whole bunch of
hoops to get there. We create a fake/virtual /dev/diskX entry for the snapshot. '''diskarbitrationd''' will wake up due to new disk, it will enter the probe phase, which includes calling
all the /System/Library/Filesystems/ bundles. Eventually, zfs.util is called and we reply affirmative. However, automount is disable here, as there is no way to specify a mountpoint with auto.
zfs.util will call DADiskMount to mount it to the correct directory.

This means we have a few more VNOPs in zfs_ctldir.c, as we have to reply with correct information to make mount successful. The first getattr will cause the mount attempt, the DADiskMount call will cause getattr to be called
and we have to pretend to have said entry.

=== spl_vn_rdwr vs vn_rdwr ===

There are two calls to vn_rdwr() in OSX's SPL. The '''spl_vn_rdwr()''' call needs to be used when zfs_onexit is in use. For example, dmu_send.c (zfs recv/send) and zfs_ioc_diff (zfs diff). The XNU implementation of
zfs_onexit (as in calls to '''getf''' and '''releasef''') need to place the internal XNU ''struct fileproc''' in the wrapper ''struct spl_fileproc'', so that '''spl_vn_rdwr()''' can use it to do IO.
This is the only way to do IO on a non-file based vnode (ie, pipe or socket). Other places that call vn_rdwr(), for example vdev_file.c, needs to call the regular vn_rdwr.

=== getattr ===

XNU has a whole bunch of items that it can ask for in vnop_getattr, including VA_NAME, which is used heavily by Finder (especially in the vfs_vget path). Care is needed here to return the correct name,
including for link (hard links) targets. VNOP_LOOKUP records the name that was used in the lookup, so that a following stat call (vnop_getattr) on the vnode will return the correct name if VA_NAME is requested.