author | Linus Torvalds <torvalds@linux-foundation.org> | |
Sun, 16 Dec 2012 22:33:25 +0000 (14:33 -0800) | ||
committer | Linus Torvalds <torvalds@linux-foundation.org> | |
Sun, 16 Dec 2012 23:18:08 +0000 (15:18 -0800) |
Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
"There are three implementations for NUMA balancing, this tree
(balancenuma), numacore which has been developed in tip/master and
autonuma which is in aa.git.
In almost all respects balancenuma is the dumbest of the three because
its main impact is on the VM side with no attempt to be smart about
scheduling. In the interest of getting the ball rolling, it would be
desirable to see this much merged for 3.8 with the view to building
scheduler smarts on top and adapting the VM where required for 3.9.
The most recent set of comparisons available from different people are
mel: https://lkml.org/lkml/2012/12/9/108
mingo: https://lkml.org/lkml/2012/12/7/331
tglx: https://lkml.org/lkml/2012/12/10/437
srikar: https://lkml.org/lkml/2012/12/10/397
The results are a mixed bag. In my own tests, balancenuma does
reasonably well. It's dumb as rocks and does not regress against
mainline. On the other hand, Ingo's tests shows that balancenuma is
incapable of converging for this workloads driven by perf which is bad
but is potentially explained by the lack of scheduler smarts. Thomas'
results show balancenuma improves on mainline but falls far short of
numacore or autonuma. Srikar's results indicate we all suffer on a
large machine with imbalanced node sizes.
My own testing showed that recent numacore results have improved
dramatically, particularly in the last week but not universally.
We've butted heads heavily on system CPU usage and high levels of
migration even when it shows that overall performance is better.
There are also cases where it regresses. Of interest is that for
specjbb in some configurations it will regress for lower numbers of
warehouses and show gains for higher numbers which is not reported by
the tool by default and sometimes missed in treports. Recently I
reported for numacore that the JVM was crashing with
NullPointerExceptions but currently it's unclear what the source of
this problem is. Initially I thought it was in how numacore batch
handles PTEs but I'm no longer think this is the case. It's possible
numacore is just able to trigger it due to higher rates of migration.
These reports were quite late in the cycle so I/we would like to start
with this tree as it contains much of the code we can agree on and has
not changed significantly over the last 2-3 weeks."
* tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
mm/rmap: Convert the struct anon_vma::mutex to an rwsem
mm: migrate: Account a transhuge page properly when rate limiting
mm: numa: Account for failed allocations and isolations as migration failures
mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
mm: numa: Add THP migration for the NUMA working set scanning fault case.
mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
mm: sched: numa: Control enabling and disabling of NUMA balancing
mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships
mm: numa: migrate: Set last_nid on newly allocated page
mm: numa: split_huge_page: Transfer last_nid on tail page
mm: numa: Introduce last_nid to the page frame
sched: numa: Slowly increase the scanning period as NUMA faults are handled
mm: numa: Rate limit setting of pte_numa if node is saturated
mm: numa: Rate limit the amount of memory that is migrated between nodes
mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
mm: numa: Migrate pages handled during a pmd_numa hinting fault
mm: numa: Migrate on reference policy
...
"There are three implementations for NUMA balancing, this tree
(balancenuma), numacore which has been developed in tip/master and
autonuma which is in aa.git.
In almost all respects balancenuma is the dumbest of the three because
its main impact is on the VM side with no attempt to be smart about
scheduling. In the interest of getting the ball rolling, it would be
desirable to see this much merged for 3.8 with the view to building
scheduler smarts on top and adapting the VM where required for 3.9.
The most recent set of comparisons available from different people are
mel: https://lkml.org/lkml/2012/12/9/108
mingo: https://lkml.org/lkml/2012/12/7/331
tglx: https://lkml.org/lkml/2012/12/10/437
srikar: https://lkml.org/lkml/2012/12/10/397
The results are a mixed bag. In my own tests, balancenuma does
reasonably well. It's dumb as rocks and does not regress against
mainline. On the other hand, Ingo's tests shows that balancenuma is
incapable of converging for this workloads driven by perf which is bad
but is potentially explained by the lack of scheduler smarts. Thomas'
results show balancenuma improves on mainline but falls far short of
numacore or autonuma. Srikar's results indicate we all suffer on a
large machine with imbalanced node sizes.
My own testing showed that recent numacore results have improved
dramatically, particularly in the last week but not universally.
We've butted heads heavily on system CPU usage and high levels of
migration even when it shows that overall performance is better.
There are also cases where it regresses. Of interest is that for
specjbb in some configurations it will regress for lower numbers of
warehouses and show gains for higher numbers which is not reported by
the tool by default and sometimes missed in treports. Recently I
reported for numacore that the JVM was crashing with
NullPointerExceptions but currently it's unclear what the source of
this problem is. Initially I thought it was in how numacore batch
handles PTEs but I'm no longer think this is the case. It's possible
numacore is just able to trigger it due to higher rates of migration.
These reports were quite late in the cycle so I/we would like to start
with this tree as it contains much of the code we can agree on and has
not changed significantly over the last 2-3 weeks."
* tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
mm/rmap: Convert the struct anon_vma::mutex to an rwsem
mm: migrate: Account a transhuge page properly when rate limiting
mm: numa: Account for failed allocations and isolations as migration failures
mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
mm: numa: Add THP migration for the NUMA working set scanning fault case.
mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
mm: sched: numa: Control enabling and disabling of NUMA balancing
mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships
mm: numa: migrate: Set last_nid on newly allocated page
mm: numa: split_huge_page: Transfer last_nid on tail page
mm: numa: Introduce last_nid to the page frame
sched: numa: Slowly increase the scanning period as NUMA faults are handled
mm: numa: Rate limit setting of pte_numa if node is saturated
mm: numa: Rate limit the amount of memory that is migrated between nodes
mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
mm: numa: Migrate pages handled during a pmd_numa hinting fault
mm: numa: Migrate on reference policy
...
37 files changed:
diff --cc Documentation/kernel-parameters.txt
Simple merge
diff --cc arch/x86/Kconfig
Simple merge
diff --cc arch/x86/mm/pgtable.c
Simple merge
diff --cc include/asm-generic/pgtable.h
Simple merge
diff --cc include/linux/huge_mm.h
Simple merge
diff --cc include/linux/hugetlb.h
Simple merge
diff --cc include/linux/mempolicy.h
Simple merge
diff --cc include/linux/migrate.h
index 0b5865c61efdf704d3e0768fa0fa4910bf7a4308,d52afb9a790ccb96cfb5096350c41acc22a1af8f..1e9f627967a3b91dbf28e4fc1fbfed484820a2cd
+++ b/include/linux/migrate.h
typedef struct page *new_page_t(struct page *, unsigned long private, int **);
+/*
+ * Return values from addresss_space_operations.migratepage():
+ * - negative errno on page migration failure;
+ * - zero on page migration success;
+ *
+ * The balloon page migration introduces this special case where a 'distinct'
+ * return code is used to flag a successful page migration to unmap_and_move().
+ * This approach is necessary because page migration can race against balloon
+ * deflation procedure, and for such case we could introduce a nasty page leak
+ * if a successfully migrated balloon page gets released concurrently with
+ * migration's unmap_and_move() wrap-up steps.
+ */
+#define MIGRATEPAGE_SUCCESS 0
+#define MIGRATEPAGE_BALLOON_SUCCESS 1 /* special ret code for balloon page
+ * sucessful migration case.
+ */
+ enum migrate_reason {
+ MR_COMPACTION,
+ MR_MEMORY_FAILURE,
+ MR_MEMORY_HOTPLUG,
+ MR_SYSCALL, /* also applies to cpusets */
+ MR_MEMPOLICY_MBIND,
+ MR_NUMA_MISPLACED,
+ MR_CMA
+ };
#ifdef CONFIG_MIGRATION
#else
static inline void putback_lru_pages(struct list_head *l) {}
+static inline void putback_movable_pages(struct list_head *l) {}
static inline int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- enum migrate_mode mode) { return -ENOSYS; }
+ enum migrate_mode mode, int reason) { return -ENOSYS; }
static inline int migrate_huge_page(struct page *page, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode) { return -ENOSYS; }
diff --cc include/linux/mm.h
Simple merge
diff --cc include/linux/mm_types.h
Simple merge
diff --cc include/linux/mmzone.h
Simple merge
diff --cc include/linux/sched.h
Simple merge
diff --cc include/linux/vm_event_item.h
Simple merge
diff --cc init/Kconfig
index 2054e048bb9844700346db4912cd33f7c92cca3d,18e2a5920a34288a18b150adc46aaa952704aac9..1a207efca5918d8ba97a8f9abffcc65527f3da2d
--- 1/init/Kconfig
--- 2/init/Kconfig
+++ b/init/Kconfig
config HAVE_UNSTABLE_SCHED_CLOCK
bool
- default y
+ #
+ # For architectures that want to enable the support for NUMA-affine scheduler
+ # balancing logic:
+ #
+ config ARCH_SUPPORTS_NUMA_BALANCING
+ bool
+
+ # For architectures that (ab)use NUMA to represent different memory regions
+ # all cpu-local but of different latencies, such as SuperH.
+ #
+ config ARCH_WANT_NUMA_VARIABLE_LOCALITY
+ bool
+
+ #
+ # For architectures that are willing to define _PAGE_NUMA as _PAGE_PROTNONE
+ config ARCH_WANTS_PROT_NUMA_PROT_NONE
+ bool
+
+ config ARCH_USES_NUMA_PROT_NONE
+ bool
+ default y
+ depends on ARCH_WANTS_PROT_NUMA_PROT_NONE
+ depends on NUMA_BALANCING
+
+ config NUMA_BALANCING_DEFAULT_ENABLED
+ bool "Automatically enable NUMA aware memory/task placement"
+ default y
+ depends on NUMA_BALANCING
+ help
+ If set, autonumic NUMA balancing will be enabled if running on a NUMA
+ machine.
+
+ config NUMA_BALANCING
+ bool "Memory placement aware NUMA scheduler"
+ depends on ARCH_SUPPORTS_NUMA_BALANCING
+ depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY
+ depends on SMP && NUMA && MIGRATION
+ help
+ This option adds support for automatic NUMA aware memory/task placement.
+ The mechanism is quite primitive and is based on migrating memory when
+ it is references to the node the task is running on.
+
+ This system will be inactive on UMA systems.
+
menuconfig CGROUPS
boolean "Control Group support"
depends on EVENTFD
diff --cc kernel/fork.c
Simple merge
diff --cc kernel/sched/core.c
Simple merge
diff --cc kernel/sched/fair.c
index 756f9f9e85422759e4c391ac1e5404d93ee9a836,3e18f611a5aa6d15e41c2e32186da7587e386ef7..9af5af979a1344ec8ddd4bab9f392238e8c00670
--- 1/kernel/sched/fair.c
--- 2/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
entity_tick(cfs_rq, se, queued);
}
+ if (sched_feat_numa(NUMA))
+ task_tick_numa(rq, curr);
++
+ update_rq_runnable_avg(rq, 1);
}
/*
diff --cc kernel/sched/features.h
Simple merge
diff --cc kernel/sched/sched.h
Simple merge
diff --cc kernel/sysctl.c
Simple merge
diff --cc mm/compaction.c
Simple merge
diff --cc mm/huge_memory.c
index 827d9c81305115d4d3b0b4c22dfadf2fc20d2a10,a24c9cb9c83eb7c82b0c6aabd28eeefacb7cfb75..d7ee1691fd21038a87cbfeffbff56897deb2fd4b
--- 1/mm/huge_memory.c
--- 2/mm/huge_memory.c
+++ b/mm/huge_memory.c
#include <linux/freezer.h>
#include <linux/mman.h>
#include <linux/pagemap.h>
+ #include <linux/migrate.h>
+
#include <asm/tlb.h>
#include <asm/pgalloc.h>
#include "internal.h"
if (__pmd_trans_huge_lock(pmd, vma) == 1) {
pmd_t entry;
entry = pmdp_get_and_clear(mm, addr, pmd);
- entry = pmd_modify(entry, newprot);
+ if (!prot_numa)
+ entry = pmd_modify(entry, newprot);
+ else {
+ struct page *page = pmd_page(*pmd);
+
+ /* only check non-shared pages */
+ if (page_mapcount(page) == 1 &&
+ !pmd_numa(*pmd)) {
+ entry = pmd_mknuma(entry);
+ }
+ }
+ BUG_ON(pmd_write(entry));
set_pmd_at(mm, addr, pmd, entry);
spin_unlock(&vma->vm_mm->page_table_lock);
ret = 1;
struct anon_vma *anon_vma;
int ret = 1;
+ BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
BUG_ON(!PageAnon(page));
- anon_vma = page_lock_anon_vma(page);
+ anon_vma = page_lock_anon_vma_read(page);
if (!anon_vma)
goto out;
ret = 0;
hend = vma->vm_end & HPAGE_PMD_MASK;
if (address < hstart || address + HPAGE_PMD_SIZE > hend)
goto out;
-
- if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
- (vma->vm_flags & VM_NOHUGEPAGE))
- goto out;
-
- if (!vma->anon_vma || vma->vm_ops)
- goto out;
- if (is_vma_temporary_stack(vma))
+ if (!hugepage_vma_check(vma))
goto out;
- VM_BUG_ON(vma->vm_flags & VM_NO_THP);
-
- pgd = pgd_offset(mm, address);
- if (!pgd_present(*pgd))
- goto out;
-
- pud = pud_offset(pgd, address);
- if (!pud_present(*pud))
+ pmd = mm_find_pmd(mm, address);
+ if (!pmd)
goto out;
-
- pmd = pmd_offset(pud, address);
- /* pmd can't go away or become huge under us */
- if (!pmd_present(*pmd) || pmd_trans_huge(*pmd))
+ if (pmd_trans_huge(*pmd))
goto out;
- anon_vma_lock(vma->anon_vma);
+ anon_vma_lock_write(vma->anon_vma);
pte = pte_offset_map(pmd, address);
ptl = pte_lockptr(mm, pmd);
diff --cc mm/hugetlb.c
Simple merge
diff --cc mm/internal.h
Simple merge
diff --cc mm/ksm.c
Simple merge
diff --cc mm/memcontrol.c
Simple merge
diff --cc mm/memory-failure.c
Simple merge
diff --cc mm/memory.c
index db2e9e797a05fc67684ef735ff5623b7151ad478,39edb11b63dc0bc3ae64f7682fae7afc67df6f54..e6a3b933517e0442c22aa89ab885ac6f2f95ae5b
--- 1/mm/memory.c
--- 2/mm/memory.c
+++ b/mm/memory.c
page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
goto out;
}
+ if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
+ goto no_page_table;
if (pmd_trans_huge(*pmd)) {
if (flags & FOLL_SPLIT) {
- split_huge_page_pmd(mm, pmd);
+ split_huge_page_pmd(vma, address, pmd);
goto split_fallthrough;
}
spin_lock(&mm->page_table_lock);
barrier();
if (pmd_trans_huge(orig_pmd)) {
- if (pmd_numa(*pmd))
+ unsigned int dirty = flags & FAULT_FLAG_WRITE;
+
- if (dirty && !pmd_write(orig_pmd) &&
- !pmd_trans_splitting(orig_pmd)) {
++ if (pmd_numa(orig_pmd))
+ return do_huge_pmd_numa_page(mm, vma, address,
+ orig_pmd, pmd);
+
- if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {
++ if (dirty && !pmd_write(orig_pmd)) {
ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
orig_pmd);
/*
if (unlikely(ret & VM_FAULT_OOM))
goto retry;
return ret;
+ } else {
+ huge_pmd_set_accessed(mm, vma, address, pmd,
+ orig_pmd, dirty);
}
+
return 0;
}
}
diff --cc mm/memory_hotplug.c
Simple merge
diff --cc mm/mempolicy.c
Simple merge
diff --cc mm/migrate.c
index cae02711181dd7ad0bde55e9eecfc0dd5bccaa83,6e46485f014c8a206d8ca5d425ed00cfb0060d52..32efd8028bc9742ba5595ee57ff8e2b8722f9c91
--- 1/mm/migrate.c
--- 2/mm/migrate.c
+++ b/mm/migrate.c
case -EAGAIN:
retry++;
break;
- case 0:
+ case MIGRATEPAGE_SUCCESS:
+ nr_succeeded++;
break;
default:
/* Permanent failure */
}
}
}
- rc = 0;
+ rc = nr_failed + retry;
out:
+ if (nr_succeeded)
+ count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
+ if (nr_failed)
+ count_vm_events(PGMIGRATE_FAIL, nr_failed);
+ trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason);
+
if (!swapwrite)
current->flags &= ~PF_SWAPWRITE;
diff --cc mm/mmap.c
Simple merge
diff --cc mm/mprotect.c
index e8c3938db6faecf6f91c47db790a36271fa8d938,dce6fb48edc4612e090702a4d58d123946eacf95..3dca970367db26585388353c41fc29df36ab450d
--- 1/mm/mprotect.c
--- 2/mm/mprotect.c
+++ b/mm/mprotect.c
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
- split_huge_page_pmd(vma->vm_mm, pmd);
+ split_huge_page_pmd(vma, addr, pmd);
- else if (change_huge_pmd(vma, pmd, addr, newprot))
+ else if (change_huge_pmd(vma, pmd, addr, newprot, prot_numa)) {
+ pages += HPAGE_PMD_NR;
continue;
+ }
/* fall through */
}
if (pmd_none_or_clear_bad(pmd))
diff --cc mm/mremap.c
Simple merge
diff --cc mm/page_alloc.c
index 83637dfba110c8308570d7f75ec46bca7386dc3a,73f226a1206e6b132273cbcb0b6f8a0bc5a68c60..d037c8bc15123cabf0d1c0c0043e028e59a64686
--- 1/mm/page_alloc.c
--- 2/mm/page_alloc.c
+++ b/mm/page_alloc.c
ret = migrate_pages(&cc->migratepages,
alloc_migrate_target,
- 0, false, MIGRATE_SYNC);
+ 0, false, MIGRATE_SYNC,
+ MR_CMA);
}
- putback_lru_pages(&cc->migratepages);
+ putback_movable_pages(&cc->migratepages);
return ret > 0 ? 0 : ret;
}
diff --cc mm/rmap.c
Simple merge
diff --cc mm/vmstat.c
Simple merge