zalloc project

Thu May 27 07:51:45 PDT 2021

Here's a small progress report, together with some long-winded notes. I
don't necessarily expect anyone to read them, but advice or other
comments are always appreciated!

Summary:
- I switched SWAPMETA over to kmalloc_obj with no deadlock mitigation,
  because I want to see that fail before I start fixing it. (Diff at
  bottom.)
- I haven't observed a deadlock yet.
- I think I just need to try harder to get the deadlock.


In a bit more detail:

I tried blindly switching the SWAPMETA subsystem from zalloc to
kmalloc_obj, without changing kmalloc_obj at all (diff at bottom). I
then did a small stress-test on Tuesday, hoping to observe some kind of
deadlock.

My goal: I want to observe the problem before I go about adding code to
fix it. Trying to solve a problem I've never observed seems like a bad
idea!

Unfortunately, I have been unsuccessful so far: the system ran fine
during my stress test. After thinking it through, it's not that
surprising to me. I think I may need a better stress test.

The rest of this email is my own notes about understanding what is
going on. Feel free to ignore; I'm writing this partly to get my own
thoughts in order.


In long-winded detail:

The stress test: I set hw.physmem="2g" in /boot/loader.conf, then
opened some chrome tabs and built a new kernel. A little over 1GiB of
swap got eaten up, but the system didn't perform too badly.

Based on my thoughts below, I think I should make two changes to this
stress test:
- Reduce swap space so that it fills up.
- Allocate memory more aggressively. Maybe I could almost fill
  memory+swap, then start a tight loop of:
  - Allocate one new page
  - Free one old page
  Though I should think about how much that will wear my SSD.

How I expected a deadlock to happen: at some point, SWAPMETA tries to
allocate a new struct swblock, and _kmalloc_obj needs a new slab to
store it, and requests a new page. v_free_count is less than
v_free_reserved so the allocation fails, and no further progress can be
made.

I don't think kmalloc_obj currently dips into the reserve: it calls
kmem_slab_alloc with flags = M_WAITOK. It appears kmem_slab_alloc will
dip into the reserve if td->td_preempted is nonzero, but I don't think
that will be true for the pageout demon.

Here are my best guesses as to why the deadlock didn't happen.

- The pageout demon starts paging out well before vmstats.v_free_min is
  reached, as controlled by these numbers:

  	vmstats.v_paging_wait = vmstats.v_free_min * 2;
  	vmstats.v_paging_start = vmstats.v_free_min * 3;
  	vmstats.v_paging_target1 = vmstats.v_free_min * 4;
  	vmstats.v_paging_target2 = vmstats.v_free_min * 5;

  So, for a deadlock to actually occur, memory use would need to grow
  faster than the paging demon can keep up with.

  One way to tweak my stress-test to get around this this would be to
  fill up the swap, so there's nothing more the pageout demon can do.
  But in this case it may also stop trying to allocate swblock structs,
  so maybe the deadlock can't happen then? I need to study the code
  more carefully to understand that.

- kmalloc_obj only rarely actually needs a new slab, because a slab can
  hold many structs swblock. This helps reduce the chance of a
  deadlock; most swblock structs can just be allocated without needing
  any new pages.

- Even if the number of free pages got down to v_free_min, there are
  still ways for new pages to free up. For example, a process might
  terminate. I think it is also possible for a page-out I/O operation
  that was already in progress to complete, allowing a page to move
  from the inactive queue to the cache queue.

I am considering switching my attention to the MAP ENTRY subsystem,
because the deadlock problem there appears to be much more clear-cut:
vm_map_entry_reserve calls zalloc which needs to avoid calling
vm_map_entry_reserve to avoid a loop. So, I expect naïvly switching to
kmalloc_obj will deadlock almost immediately, giving me a starting
point for fixing the problem.

Here's the diff I tried (applied at commit 577b958f5e).

diff --git a/sys/vm/swap_pager.c b/sys/vm/swap_pager.c
index 045842664e..743a2ff6f5 100644
--- a/sys/vm/swap_pager.c
+++ b/sys/vm/swap_pager.c
@@ -116,7 +116,6 @@
 #include <vm/vm_pageout.h>
 #include <vm/swap_pager.h>
 #include <vm/vm_extern.h>
-#include <vm/vm_zone.h>
 #include <vm/vnode_pager.h>
 
 #include <sys/buf2.h>
@@ -210,7 +209,8 @@ SYSCTL_INT(_vm, OID_AUTO, swap_size,
 SYSCTL_INT(_vm, OID_AUTO, report_swap_allocs,
         CTLFLAG_RW, &vm_report_swap_allocs, 0, "");
 
-__read_mostly vm_zone_t	swap_zone;
+MALLOC_DEFINE_OBJ(M_SWAP_SWBLOCK, sizeof(struct swblock),
+		  "swblock", "swblock structures");
 
 /*
  * Red-Black tree for swblock entries
@@ -392,7 +392,12 @@ SYSINIT(vm_mem, SI_BOOT1_VM, SI_ORDER_THIRD, swap_pager_init, NULL);
 void
 swap_pager_swap_init(void)
 {
+	/*
+	 * TODO: This is where custom initialization might go.
+	 */
+#if 0
 	int n, n2;
+#endif
 
 	/*
 	 * Number of in-transit swap bp operations.  Don't
@@ -424,6 +429,10 @@ swap_pager_swap_init(void)
 	nsw_wcount_async = 4;
 	nsw_wcount_async_max = nsw_wcount_async;
 
+	/*
+	 * TODO: This is where custom initialization might go.
+	 */
+#if 0
 	/*
 	 * The zone is dynamically allocated so generally size it to
 	 * maxswzone (32MB to 256GB of KVM).  Set a minimum size based
@@ -454,8 +463,10 @@ swap_pager_swap_init(void)
 
 	if (swap_zone == NULL)
 		panic("swap_pager_swap_init: swap_zone == NULL");
+
 	if (n2 != n)
 		kprintf("Swap zone entries reduced from %d to %d.\n", n2, n);
+#endif
 }
 
 /*
@@ -2358,7 +2369,12 @@ swp_pager_meta_build(vm_object_t object, vm_pindex_t index, swblk_t swapblk)
 	if (swap == NULL) {
 		int i;
 
-		swap = zalloc(swap_zone);
+		/*
+		 * TODO: Deal with possible deadlock, and make sure flags make
+		 * sense.
+		 */
+		swap = kmalloc_obj(sizeof *swap, M_SWAP_SWBLOCK,
+				   M_NULLOK | M_INTWAIT);
 		if (swap == NULL) {
 			vm_wait(0);
 			goto retry;
@@ -2473,21 +2489,17 @@ swp_pager_meta_free_callback(struct swblock *swap, void *data)
 	/*
 	 * Scan and free the blocks.  The loop terminates early
 	 * if (swap) runs out of blocks and could be freed.
-	 *
-	 * NOTE: Decrement swb_count after swp_pager_freeswapspace()
-	 *	 to deal with a zfree race.
 	 */
 	while (index <= eindex) {
 		swblk_t v = swap->swb_pages[index];
 
 		if (v != SWAPBLK_NONE) {
 			swap->swb_pages[index] = SWAPBLK_NONE;
-			/* can block */
 			swp_pager_freeswapspace(object, v, 1);
 			--mycpu->gd_vmtotal.t_vm;
 			if (--swap->swb_count == 0) {
 				swp_pager_remove(object, swap);
-				zfree(swap_zone, swap);
+				kfree_obj(swap, M_SWAP_SWBLOCK);
 				--object->swblock_count;
 				break;
 			}
@@ -2495,7 +2507,7 @@ swp_pager_meta_free_callback(struct swblock *swap, void *data)
 		++index;
 	}
 
-	/* swap may be invalid here due to zfree above */
+	/* swap may be invalid here due to kfree_obj above */
 	lwkt_yield();
 
 	return(0);
@@ -2533,7 +2545,7 @@ swp_pager_meta_free_all(vm_object_t object)
 		}
 		if (swap->swb_count != 0)
 			panic("swap_pager_meta_free_all: swb_count != 0");
-		zfree(swap_zone, swap);
+		kfree_obj(swap, M_SWAP_SWBLOCK);
 		--object->swblock_count;
 		lwkt_yield();
 	}
@@ -2584,7 +2596,7 @@ swp_pager_meta_ctl(vm_object_t object, vm_pindex_t index, int flags)
 				--mycpu->gd_vmtotal.t_vm;
 				if (--swap->swb_count == 0) {
 					swp_pager_remove(object, swap);
-					zfree(swap_zone, swap);
+					kfree_obj(swap, M_SWAP_SWBLOCK);
 					--object->swblock_count;
 				}
 			} 

-- 
James