aboutsummaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* NFSv4.x: Fix handling of partially delegated locksTrond Myklebust2013-05-111-2/+2
| | | | | | | | | | | | | | | | | | commit c5a2a15f8146fdfe45078df7873a6dc1006b3869 upstream. If a NFS client receives a delegation for a file after it has taken a lock on that file, we can currently end up in a situation where we mistakenly skip unlocking that file. The following patch swaps an erroneous check in nfs4_proc_unlck for whether or not the file has a delegation to one which checks whether or not we hold a lock stateid for that file. Reported-by: Chuck Lever <Chuck.Lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Tested-by: Chuck Lever <Chuck.Lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: fix extent logging with O_DIRECT into preallocJosef Bacik2013-05-111-8/+13
| | | | | | | | | | | | | | | | | | commit eb384b55ae9c2055ea00c5cc87971e182d47aefa upstream. This is the same as the fix from commit Btrfs: fix bad extent logging but for O_DIRECT. I missed this when I fixed the problem originally, we were still using the em for the orig_start and orig_block_len, which would be the merged extent. We need to use the actual extent from the on disk file extent item, which we have to lookup to make sure it's ok to nocow anyway so just pass in some pointers to hold this info. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: compare relevant parts of delayed tree refsJosef Bacik2013-05-111-10/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 41b0fc42800569f63e029549b75c4c9cb63f2dfd upstream. A user reported a panic while running a balance. What was happening was he was relocating a block, which added the reference to the relocation tree. Then relocation would walk through the relocation tree and drop that reference and free that block, and then it would walk down a snapshot which referenced the same block and add another ref to the block. The problem is this was all happening in the same transaction, so the parent block was free'ed up when we drop our reference which was immediately available for allocation, and then it was used _again_ to add a reference for the same block from a different snapshot. This resulted in something like this in the delayed ref tree add ref to 90234880, parent=2067398656, ref_root 1766, level 1 del ref to 90234880, parent=2067398656, ref_root 18446744073709551608, level 1 add ref to 90234880, parent=2067398656, ref_root 1767, level 1 as you can see the ref_root's don't match, because when we inc the ref we use the header owner, which is the original tree the block belonged to, instead of the data reloc tree. Then when we remove the extent we use the reloc tree objectid. But none of this matters, since it is a shared reference which means only the parent matters. When the delayed ref stuff runs it adds all the increments first, and then does all the drops, to make sure that we don't delete the ref if we net a positive ref count. But tree blocks aren't allowed to have multiple refs from the same block, so this panics when it tries to add the second ref. We need the add and the drop to cancel each other out in memory so we only do the final add. So to fix this we need to adjust how the delayed refs are added to the tree. Only the ref_root matters when it is a normal backref, and only the parent matters when it is a shared backref. So make our decision based on what ref type we have. This allows us to keep the ref_root in memory in case anybody wants to use it for something else, and it allows the delayed refs to be merged properly so we don't end up with this panic. With this patch the users image no longer panics on mount, and it has a clean fsck after a normal mount/umount cycle. Thanks, Reported-by: Roman Mamedov <rm@romanrm.ru> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: add check for inodes_count overflow in new resize ioctlTheodore Ts'o2013-05-111-0/+4
| | | | | | | | | | | | | commit 3f8a6411fbada1fa482276591e037f3b1adcf55b upstream. Addresses-Red-Hat-Bugzilla: #913245 Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Lingzhu Xiang <lxiang@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* hugetlbfs: fix mmap failure in unaligned size requestNaoya Horiguchi2013-05-111-14/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit af73e4d9506d3b797509f3c030e7dcd554f7d9c4 upstream. The current kernel returns -EINVAL unless a given mmap length is "almost" hugepage aligned. This is because in sys_mmap_pgoff() the given length is passed to vm_mmap_pgoff() as it is without being aligned with hugepage boundary. This is a regression introduced in commit 40716e29243d ("hugetlbfs: fix alignment of huge page requests"), where alignment code is pushed into hugetlb_file_setup() and the variable len in caller side is not changed. To fix this, this patch partially reverts that commit, and adds alignment code in caller side. And it also introduces hstate_sizelog() in order to get proper hstate to specified hugepage size. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=56881 [akpm@linux-foundation.org: fix warning when CONFIG_HUGETLB_PAGE=n] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: <iceman_dvd@yahoo.com> Cc: Steven Truelove <steven.truelove@utoronto.ca> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* autofs - remove autofs dentry mount checkDavid Jeffery2013-05-111-9/+0
| | | | | | | | | | | | | | | | | | | | | | commit ce8a5dbdf9e709bdaf4618d7ef8cceb91e8adc69 upstream. When checking if an autofs mount point is busy it isn't sufficient to only check if it's a mount point. For example, if the mount of an offset mountpoint in a tree is denied for this host by its export and the dentry becomes a process working directory the check incorrectly returns the mount as not in use at expire. This can happen since the default when mounting within a tree is nostrict, which means ingnore mount fails on mounts within the tree and continue. The nostrict option is meant to allow mounting in this case. Signed-off-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: fix Kconfig documentation for CONFIG_EXT4_DEBUGTheodore Ts'o2013-05-071-1/+2
| | | | | | | | | | | | commit 7f3e3c7cfcec148ccca9c0dd2dbfd7b00b7ac10f upstream. Fox the Kconfig documentation for CONFIG_EXT4_DEBUG to match the change made by commit a0b30c1229: ext4: use module parameters instead of debugfs for mballoc_debug Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: fix online resizing for ext3-compat file systemsTheodore Ts'o2013-05-071-0/+2
| | | | | | | | | | | | | | commit c5c72d814cf0f650010337c73638b25e6d14d2d4 upstream. Commit fb0a387dcdc restricts block allocations for indirect-mapped files to block groups less than s_blockfile_groups. However, the online resizing code wasn't setting s_blockfile_groups, so the newly added block groups were not available for non-extent mapped files. Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: fix big-endian bug in metadata checksum calculationsDmitry Monakhov2013-05-072-5/+5
| | | | | | | | | commit 171a7f21a76a0958c225b97c00a97a10390d40ee upstream. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: fix journal callback list traversalDmitry Monakhov2013-05-073-7/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 5d3ee20855e28169d711b394857ee608a5023094 upstream. It is incorrect to use list_for_each_entry_safe() for journal callback traversial because ->next may be removed by other task: ->ext4_mb_free_metadata() ->ext4_mb_free_metadata() ->ext4_journal_callback_del() This results in the following issue: WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250() Hardware name: list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod Pid: 16400, comm: jbd2/dm-1-8 Tainted: G W 3.8.0-rc3+ #107 Call Trace: [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0 [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50 [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0 [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250 [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0 [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570 [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0 [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0 [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0 [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40 [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80 [<ffffffff810ac6be>] kthread+0x10e/0x120 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70 [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70 This patch fix the issue as follows: - ext4_journal_commit_callback() make list truly traversial safe simply by always starting from list_head - fix race between two ext4_journal_callback_del() and ext4_journal_callback_try_del() Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callbackDmitry Monakhov2013-05-071-22/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 794446c6946513c684d448205fbd76fa35f38b72 upstream. The following race is possible: [kjournald2] other_task jbd2_journal_commit_transaction() j_state = T_FINISHED; spin_unlock(&journal->j_list_lock); ->jbd2_journal_remove_checkpoint() ->jbd2_journal_free_transaction(); ->kmem_cache_free(transaction) ->j_commit_callback(journal, transaction); -> USE_AFTER_FREE WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250() Hardware name: list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod Pid: 16400, comm: jbd2/dm-1-8 Tainted: G W 3.8.0-rc3+ #107 Call Trace: [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0 [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50 [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0 [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250 [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0 [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570 [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0 [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0 [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0 [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40 [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80 [<ffffffff810ac6be>] kthread+0x10e/0x120 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70 [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70 In order to demonstrace this issue one should mount ext4 with mount -o discard option on SSD disk. This makes callback longer and race window becomes wider. In order to fix this we should mark transaction as finished only after callbacks have completed Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4/jbd2: don't wait (forever) for stale tid caused by wraparoundTheodore Ts'o2013-05-073-4/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit d76a3a77113db020d9bb1e894822869410450bd9 upstream. In the case where an inode has a very stale transaction id (tid) in i_datasync_tid or i_sync_tid, it's possible that after a very large (2**31) number of transactions, that the tid number space might wrap, causing tid_geq()'s calculations to fail. Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily", attempted to fix this problem, but it only avoided kjournald spinning forever by fixing the logic in jbd2_log_start_commit(). Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c that might call jbd2_log_start_commit() with a stale tid, those functions will subsequently call jbd2_log_wait_commit() with the same stale tid, and then wait for a very long time. To fix this, we replace the calls to jbd2_log_start_commit() and jbd2_log_wait_commit() with a call to a new function, jbd2_complete_transaction(), which will correctly handle stale tid's. As a bonus, jbd2_complete_transaction() will avoid locking j_state_lock for writing unless a commit needs to be started. This should have a small (but probably not measurable) improvement for ext4's scalability. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Ben Hutchings <ben@decadent.org.uk> Reported-by: George Barnett <gbarnett@atlassian.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* nfsd: Decode and send 64bit time valuesBryan Schumaker2013-05-071-14/+5
| | | | | | | | | | | | | | commit bf8d909705e9d9bac31d9b8eac6734d2b51332a7 upstream. The seconds field of an nfstime4 structure is 64bit, but we are assuming that the first 32bits are zero-filled. So if the client tries to set atime to a value before the epoch (touch -t 196001010101), then the server will save the wrong value on disk. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* nfsd: don't run get_file if nfs4_preprocess_stateid_op return errorfanchaoting2013-05-071-4/+4
| | | | | | | | | | | | commit b022032e195ffca83d7002d6b84297d796ed443b upstream. we should return error status directly when nfs4_preprocess_stateid_op return error. Signed-off-by: fanchaoting <fanchaoting@cn.fujitsu.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* nfsd4: don't close read-write opens too soonJ. Bruce Fields2013-05-071-7/+1
| | | | | | | | | | | | | | | commit 0c7c3e67ab91ec6caa44bdf1fc89a48012ceb0c5 upstream. Don't actually close any opens until we don't need them at all. This means being left with write access when it's not really necessary, but that's better than putting a file that might still have posix locks held on it, as we have been. Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* NFSv4: Handle NFS4ERR_DELAY and NFS4ERR_GRACE in nfs4_open_delegation_recallTrond Myklebust2013-05-071-0/+6
| | | | | | | | | | | | | commit 8b6cc4d6f841d31f72fe7478453759166d366274 upstream. A server shouldn't normally return NFS4ERR_GRACE if the client holds a delegation, since no conflicting lock reclaims can be granted, however the spec does not require the server to grant the open in this instance Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* LOCKD: Ensure that nlmclnt_block resets block->b_status after a server rebootTrond Myklebust2013-05-072-3/+3
| | | | | | | | | | | | | | | | | | commit 1dfd89af8697a299e7982ae740d4695ecd917eef upstream. After a server reboot, the reclaimer thread will recover all the existing locks. For locks that are blocked, however, it will change the value of block->b_status to nlm_lck_denied_grace_period in order to signal that they need to wake up and resend the original blocking lock request. Due to a bug, however, the block->b_status never gets reset after the blocked locks have been woken up, and so the process goes into an infinite loop of resends until the blocked lock is satisfied. Reported-by: Marc Eshel <eshel@us.ibm.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* exec: do not abuse ->cred_guard_mutex in threadgroup_lock()Oleg Nesterov2013-05-071-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e56fb2874015370e3b7f8d85051f6dce26051df9 upstream. threadgroup_lock() takes signal->cred_guard_mutex to ensure that thread_group_leader() is stable. This doesn't look nice, the scope of this lock in do_execve() is huge. And as Dave pointed out this can lead to deadlock, we have the following dependencies: do_execve: cred_guard_mutex -> i_mutex cgroup_mount: i_mutex -> cgroup_mutex attach_task_by_pid: cgroup_mutex -> cred_guard_mutex Change de_thread() to take threadgroup_change_begin() around the switch-the-leader code and change threadgroup_lock() to avoid ->cred_guard_mutex. Note that de_thread() can't sleep with ->group_rwsem held, this can obviously deadlock with the exiting leader if the writer is active, so it does threadgroup_change_end() before schedule(). Reported-by: Dave Jones <davej@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* fs/dcache.c: add cond_resched() to shrink_dcache_parent()Greg Thelen2013-05-071-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 421348f1ca0bf17769dee0aed4d991845ae0536d upstream. Call cond_resched() in shrink_dcache_parent() to maintain interactivity. Before this patch: void shrink_dcache_parent(struct dentry * parent) { while ((found = select_parent(parent, &dispose)) != 0) shrink_dentry_list(&dispose); } select_parent() populates the dispose list with dentries which shrink_dentry_list() then deletes. select_parent() carefully uses need_resched() to avoid doing too much work at once. But neither shrink_dcache_parent() nor its called functions call cond_resched(). So once need_resched() is set select_parent() will return single dentry dispose list which is then deleted by shrink_dentry_list(). This is inefficient when there are a lot of dentry to process. This can cause softlockup and hurts interactivity on non preemptable kernels. This change adds cond_resched() in shrink_dcache_parent(). The benefit of this is that need_resched() is quickly cleared so that future calls to select_parent() are able to efficiently return a big batch of dentry. These additional cond_resched() do not seem to impact performance, at least for the workload below. Here is a program which can cause soft lockup if other system activity sets need_resched(). int main() { struct rlimit rlim; int i; int f[100000]; char buf[20]; struct timeval t1, t2; double diff; /* cleanup past run */ system("rm -rf x"); /* boost nfile rlimit */ rlim.rlim_cur = 200000; rlim.rlim_max = 200000; if (setrlimit(RLIMIT_NOFILE, &rlim)) err(1, "setrlimit"); /* make directory for files */ if (mkdir("x", 0700)) err(1, "mkdir"); if (gettimeofday(&t1, NULL)) err(1, "gettimeofday"); /* populate directory with open files */ for (i = 0; i < 100000; i++) { snprintf(buf, sizeof(buf), "x/%d", i); f[i] = open(buf, O_CREAT); if (f[i] == -1) err(1, "open"); } /* close some of the files */ for (i = 0; i < 85000; i++) close(f[i]); /* unlink all files, even open ones */ system("rm -rf x"); if (gettimeofday(&t2, NULL)) err(1, "gettimeofday"); diff = (((double)t2.tv_sec * 1000000 + t2.tv_usec) - ((double)t1.tv_sec * 1000000 + t1.tv_usec)); printf("done: %g elapsed\n", diff/1e6); return 0; } Signed-off-by: Greg Thelen <gthelen@google.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* inotify: invalid mask should return a error number but not set itZhao Hongjiang2013-05-071-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 04df32fa10ab9a6f0643db2949d42efc966bc844 upstream. When we run the crackerjack testsuite, the inotify_add_watch test is stalled. This is caused by the invalid mask 0 - the task is waiting for the event but it never comes. inotify_add_watch() should return -EINVAL as it did before commit 676a0675cf92 ("inotify: remove broken mask checks causing unmount to be EINVAL"). That commit removes the invalid mask check, but that check is needed. Check the mask's ALL_INOTIFY_BITS before the inotify_arg_to_mask() call. If none are set, just return -EINVAL. Because IN_UNMOUNT is in ALL_INOTIFY_BITS, this change will not trigger the problem that above commit fixed. [akpm@linux-foundation.org: fix build] Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com> Acked-by: Jim Somerville <Jim.Somerville@windriver.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Eric Paris <eparis@parisplace.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm: allow arch code to control the user page table ceilingHugh Dickins2013-05-071-2/+2
| | | | | | | | | | | | | | | | | | | | commit 6ee8630e02be6dd89926ca0fbc21af68b23dc087 upstream. On architectures where a pgd entry may be shared between user and kernel (e.g. ARM+LPAE), freeing page tables needs a ceiling other than 0. This patch introduces a generic USER_PGTABLES_CEILING that arch code can override. It is the responsibility of the arch code setting the ceiling to ensure the complete freeing of the page tables (usually in pgd_free()). [catalin.marinas@arm.com: commit log; shift_arg_pages(), asm-generic/pgtables.h changes] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Russell King <linux@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* fs/fscache/stats.c: fix memory leakAnurup m2013-05-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | commit ec686c9239b4d472052a271c505d04dae84214cc upstream. There is a kernel memory leak observed when the proc file /proc/fs/fscache/stats is read. The reason is that in fscache_stats_open, single_open is called and the respective release function is not called during release. Hence fix with correct release function - single_release(). Addresses https://bugzilla.kernel.org/show_bug.cgi?id=57101 Signed-off-by: Anurup m <anurup.m@huawei.com> Cc: shyju pv <shyju.pv@huawei.com> Cc: Sanil kumar <sanil.kumar@huawei.com> Cc: Nataraj m <nataraj.m@huawei.com> Cc: Li Zefan <lizefan@huawei.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sysfs: fix use after free in case of concurrent read/write and readdirMing Lei2013-05-071-4/+11
| | | | | | | | | | | | | | | | | | | | | commit f7db5e7660b122142410dcf36ba903c73d473250 upstream. The inode->i_mutex isn't hold when updating filp->f_pos in read()/write(), so the filp->f_pos might be read as 0 or 1 in readdir() when there is concurrent read()/write() on this same file, then may cause use after free in readdir(). The bug can be reproduced with Li Zefan's test code on the link: https://patchwork.kernel.org/patch/2160771/ This patch fixes the use after free under this situation. Reported-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* aio: fix possible invalid memory access when DEBUG is enabledZhao Hongjiang2013-05-011-1/+1
| | | | | | | | | | | commit 91d80a84bbc8f28375cca7e65ec666577b4209ad upstream. dprintk() shouldn't access @ring after it's unmapped. Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: make sure nbytes are right after log replayJosef Bacik2013-04-251-6/+42
| | | | | | | | | | | | | | | | | | | | | commit 4bc4bee4595662d8bff92180d5c32e3313a704b0 upstream. While trying to track down a tree log replay bug I noticed that fsck was always complaining about nbytes not being right for our fsynced file. That is because the new fsync stuff doesn't wait for ordered extents to complete, so the inodes nbytes are not necessarily updated properly when we log it. So to fix this we need to set nbytes to whatever it is on the inode that is on disk, so when we replay the extents we can just add the bytes that are being added as we replay the extent. This makes it work for the case that we have the wrong nbytes or the case that we logged everything and nbytes is actually correct. With this I'm no longer getting nbytes errors out of btrfsck. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Lingzhu Xiang <lxiang@redhat.com> Reviewed-by: CAI Qian <caiqian@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* hfsplus: fix potential overflow in hfsplus_file_truncate()Vyacheslav Dubeyko2013-04-251-1/+1
| | | | | | | | | | | | | | | commit 12f267a20aecf8b84a2a9069b9011f1661c779b4 upstream. Change a u32 to loff_t hfsplus_file_truncate(). Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hin-Tak Leung <htl10@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* fs/binfmt_elf.c: fix hugetlb memory check in vma_dump_size()Naoya Horiguchi2013-04-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | commit 23d9e482136e31c9d287633a6e473daa172767c4 upstream. Documentation/filesystems/proc.txt says about coredump_filter bitmask, Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only effected by bit 5-6. However current code can go into the subsequent flag checks of bit 0-4 for vma(VM_HUGETLB). So this patch inserts 'return' and makes it work as written in the document. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB)Naoya Horiguchi2013-04-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | commit a2fce9143057f4eb7675a21cca1b6beabe585c8b upstream. Currently we fail to include any data on hugepages into coredump, because VM_DONTDUMP is set on hugetlbfs's vma. This behavior was recently introduced by commit 314e51b9851b ("mm: kill vma flag VM_RESERVED and mm->reserved_vm counter"). This looks to me a serious regression, so let's fix it. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kthread: Prevent unpark race which puts threads on the wrong cpuThomas Gleixner2013-04-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f2530dc71cf0822f90bb63ea4600caaef33a66bb upstream. The smpboot threads rely on the park/unpark mechanism which binds per cpu threads on a particular core. Though the functionality is racy: CPU0 CPU1 CPU2 unpark(T) wake_up_process(T) clear(SHOULD_PARK) T runs leave parkme() due to !SHOULD_PARK bind_to(CPU2) BUG_ON(wrong CPU) We cannot let the tasks move themself to the target CPU as one of those tasks is actually the migration thread itself, which requires that it starts running on the target cpu right away. The solution to this problem is to prevent wakeups in park mode which are not from unpark(). That way we can guarantee that the association of the task to the target cpu is working correctly. Add a new task state (TASK_PARKED) which prevents other wakeups and use this state explicitly for the unpark wakeup. Peter noticed: Also, since the task state is visible to userspace and all the parked tasks are still in the PID space, its a good hint in ps and friends that these tasks aren't really there for the moment. The migration thread has another related issue. CPU0 CPU1 Bring up CPU2 create_thread(T) park(T) wait_for_completion() parkme() complete() sched_set_stop_task() schedule(TASK_PARKED) The sched_set_stop_task() call is issued while the task is on the runqueue of CPU1 and that confuses the hell out of the stop_task class on that cpu. So we need the same synchronizaion before sched_set_stop_task(). Reported-by: Dave Jones <davej@redhat.com> Reported-and-tested-by: Dave Hansen <dave@sr71.net> Reported-and-tested-by: Borislav Petkov <bp@alien8.de> Acked-by: Peter Ziljstra <peterz@infradead.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: dhillf@gmail.com Cc: Ingo Molnar <mingo@kernel.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* vfs: Revert spurious fix to spinning prevention in prune_icache_sbSuleiman Souhlal2013-04-161-1/+1
| | | | | | | | | | | | | | | | | | | | commit 5b55d708335a9e3e4f61f2dadf7511502205ccd1 upstream. Revert commit 62a3ddef6181 ("vfs: fix spinning prevention in prune_icache_sb"). This commit doesn't look right: since we are looking at the tail of the list (sb->s_inode_lru.prev) if we want to skip an inode, we should put it back at the head of the list instead of the tail, otherwise we will keep spinning on it. Discovered when investigating why prune_icache_sb came top in perf reports of a swapping load. Signed-off-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* cifs: Allow passwords which begin with a delimitorSachin Prabhu2013-04-161-3/+13
| | | | | | | | | | | | | | commit c369c9a4a7c82d33329d869cbaf93304cc7a0c40 upstream. Fixes a regression in cifs_parse_mount_options where a password which begins with a delimitor is parsed incorrectly as being a blank password. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* GFS2: return error if malloc failed in gfs2_rs_alloc()Wei Yongjun2013-04-161-1/+1
| | | | | | | | | | | | | | commit 441362d06be349430d06e37286adce4b90e6ce96 upstream. The error code in gfs2_rs_alloc() is set to ENOMEM when error but never be used, instead, gfs2_rs_alloc() always return 0. Fix to return 'error'. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Jonghwan Choi <jhbird.choi@samsung.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* GFS2: Fix unlock of fcntl locks during withdrawn stateSteven Whitehouse2013-04-161-1/+4
| | | | | | | | | | | | | | commit c2952d202f710d326ac36a8ea6bd216b20615ec8 upstream. When withdraw occurs, we need to continue to allow unlocks of fcntl locks to occur, however these will only be local, since the node has withdrawn from the cluster. This prevents triggering a VFS level bug trap due to locks remaining when a file is closed. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Jonghwan Choi <jhbird.choi@samsung.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* NFSv4: Doh! Typo in the fix to nfs41_walk_client_listTrond Myklebust2013-04-121-0/+1
| | | | | | | | | | | | commit eb04e0ac198cec3bab407ad220438dfa65c19c67 upstream. Make sure that we set the status to 0 on success. Missed in testing because it never appears when doing multiple mounts to _different_ servers. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* NFSv4/4.1: Fix bugs in nfs4[01]_walk_client_listTrond Myklebust2013-04-121-16/+28
| | | | | | | | | | | | | | | | | | | | | commit 7b1f1fd1842e6ede25183c267ae733a7f67f00bc upstream. It is unsafe to use list_for_each_entry_safe() here, because when we drop the nn->nfs_client_lock, we pin the _current_ list entry and ensure that it stays in the list, but we don't do the same for the _next_ list entry. Use of list_for_each_entry() is therefore the correct thing to do. Also fix the refcounting in nfs41_walk_client_list(). Finally, ensure that the nfs_client has finished being initialised and, in the case of NFSv4.1, that the session is set up. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* NFSv4: Fix a memory leak in nfs4_discover_server_trunkingTrond Myklebust2013-04-121-1/+7
| | | | | | | | | | | | commit b193d59a4863ea670872d76dc99231ddeb598625 upstream. When we assign a new rpc_client to clp->cl_rpcclient, we need to destroy the old one. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* reiserfs: Fix warning and inode leak when deleting inode with xattrsJan Kara2013-04-121-2/+2
| | | | | | | | | | | | | | | | | | commit 35e5cbc0af240778e61113286c019837e06aeec6 upstream. After commit 21d8a15a (lookup_one_len: don't accept . and ..) reiserfs started failing to delete xattrs from inode. This was due to a buggy test for '.' and '..' in fill_with_dentries() which resulted in passing '.' and '..' entries to lookup_one_len() in some cases. That returned error and so we failed to iterate over all xattrs of and inode. Fix the test in fill_with_dentries() along the lines of the one in lookup_one_len(). Reported-by: Pawel Zawora <pzawora@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* UBIFS: make space fixup work in the remount caseArtem Bityutskiy2013-04-121-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 67e753ca41782913d805ff4a8a2b0f60b26b7915 upstream. The UBIFS space fixup is a useful feature which allows to fixup the "broken" flash space at the time of the first mount. The "broken" space is usually the result of using a "dumb" industrial flasher which is not able to skip empty NAND pages and just writes all 0xFFs to the empty space, which has grave side-effects for UBIFS when UBIFS trise to write useful data to those empty pages. The fix-up feature works roughly like this: 1. mkfs.ubifs sets the fixup flag in UBIFS superblock when creating the image (see -F option) 2. when the file-system is mounted for the first time, UBIFS notices the fixup flag and re-writes the entire media atomically, which may take really a lot of time. 3. UBIFS clears the fixup flag in the superblock. This works fine when the file system is mounted R/W for the very first time. But it did not really work in the case when we first mount the file-system R/O, and then re-mount R/W. The reason was that we started the fixup procedure too late, which we cannot really do because we have to fixup the space before it starts being used. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Reported-by: Mark Jackson <mpfj-list@mimc.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: fix space leak when we fail to reserve metadata spaceJosef Bacik2013-04-051-6/+41
| | | | | | | | | | | | | | | | | | | | | commit f4881bc7a83eff263789dd524b7c269d138d4af5 upstream. Dave reported a warning when running xfstest 275. We have been leaking delalloc metadata space when our reservations fail. This is because we were improperly calculating how much space to free for our checksum reservations. The problem is we would sometimes free up space that had already been freed in another thread and we would end up with negative usage for the delalloc space. This patch fixes the problem by calculating how much space the other threads would have already freed, and then calculate how much space we need to free had we not done the reservation at all, and then freeing any excess space. This makes xfstests 275 no longer have leaked space. Thanks Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Lingzhu Xiang <lxiang@redhat.com> Reviewed-by: CAI Qian <caiqian@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* nfsd4: reject "negative" acl lengthsJ. Bruce Fields2013-04-051-1/+1
| | | | | | | | | | | | | | | commit 64a817cfbded8674f345d1117b117f942a351a69 upstream. Since we only enforce an upper bound, not a lower bound, a "negative" length can get through here. The symptom seen was a warning when we attempt to a kmalloc with an excessive size. Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* loop: prevent bdev freeing while device in useAnatol Pomozov2013-04-051-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c1681bf8a7b1b98edee8b862a42c19c4e53205fd upstream. struct block_device lifecycle is defined by its inode (see fs/block_dev.c) - block_device allocated first time we access /dev/loopXX and deallocated on bdev_destroy_inode. When we create the device "losetup /dev/loopXX afile" we want that block_device stay alive until we destroy the loop device with "losetup -d". But because we do not hold /dev/loopXX inode its counter goes 0, and inode/bdev can be destroyed at any moment. Usually it happens at memory pressure or when user drops inode cache (like in the test below). When later in loop_clr_fd() we want to use bdev we have use-after-free error with following stack: BUG: unable to handle kernel NULL pointer dereference at 0000000000000280 bd_set_size+0x10/0xa0 loop_clr_fd+0x1f8/0x420 [loop] lo_ioctl+0x200/0x7e0 [loop] lo_compat_ioctl+0x47/0xe0 [loop] compat_blkdev_ioctl+0x341/0x1290 do_filp_open+0x42/0xa0 compat_sys_ioctl+0xc1/0xf20 do_sys_open+0x16e/0x1d0 sysenter_dispatch+0x7/0x1a To prevent use-after-free we need to grab the device in loop_set_fd() and put it later in loop_clr_fd(). The issue is reprodusible on current Linus head and v3.3. Here is the test: dd if=/dev/zero of=loop.file bs=1M count=1 while [ true ]; do losetup /dev/loop0 loop.file echo 2 > /proc/sys/vm/drop_caches losetup -d /dev/loop0 done [ Doing bdgrab/bput in loop_set_fd/loop_clr_fd is safe, because every time we call loop_set_fd() we check that loop_device->lo_state is Lo_unbound and set it to Lo_bound If somebody will try to set_fd again it will get EBUSY. And if we try to loop_clr_fd() on unbound loop device we'll get ENXIO. loop_set_fd/loop_clr_fd (and any other loop ioctl) is called under loop_device->lo_ctl_mutex. ] Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: don't drop path when printing out tree errors in scrubJosef Bacik2013-04-051-1/+2
| | | | | | | | | | | | | | | commit d8fe29e9dea8d7d61fd140d8779326856478fc62 upstream. A user reported a panic where we were panicing somewhere in tree_backref_for_extent from scrub_print_warning. He only captured the trace but looking at scrub_print_warning we drop the path right before we mess with the extent buffer to print out a bunch of stuff, which isn't right. So fix this by dropping the path after we use the eb if we need to. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: limit the global reserve to 512mbJosef Bacik2013-04-051-1/+1
| | | | | | | | | | | | | | | | | commit fdf30d1c1b386e1b73116cc7e0fb14e962b763b0 upstream. A user reported a problem where he was getting early ENOSPC with hundreds of gigs of free data space and 6 gigs of free metadata space. This is because the global block reserve was taking up the entire free metadata space. This is ridiculous, we have infrastructure in place to throttle if we start using too much of the global reserve, so instead of letting it get this huge just limit it to 512mb so that users can still get work done. This allowed the user to complete his rsync without issues. Thanks Reported-and-tested-by: Stefan Priebe <s.priebe@profihost.ag> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: fix race between mmap writes and compressionChris Mason2013-04-053-0/+49
| | | | | | | | | | | | | | | | | | | | | | | | commit 4adaa611020fa6ac65b0ac8db78276af4ec04e63 upstream. Btrfs uses page_mkwrite to ensure stable pages during crc calculations and mmap workloads. We call clear_page_dirty_for_io before we do any crcs, and this forces any application with the file mapped to wait for the crc to finish before it is allowed to change the file. With compression on, the clear_page_dirty_for_io step is happening after we've compressed the pages. This means the applications might be changing the pages while we are compressing them, and some of those modifications might not hit the disk. This commit adds the clear_page_dirty_for_io before compression starts and makes sure to redirty the page if we have to fallback to uncompressed IO as well. Signed-off-by: Chris Mason <chris.mason@fusionio.com> Reported-by: Alexandre Oliva <oliva@gnu.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: fix locking on ROOT_REPLACE operations in tree mod logJan Schmidt2013-04-051-10/+20
| | | | | | | | | | | | | | | | | | | | | | commit d9abbf1c3131b679379762700201ae69367f3f62 upstream. To resolve backrefs, ROOT_REPLACE operations in the tree mod log are required to be tied to at least one KEY_REMOVE_WHILE_FREEING operation. Therefore, those operations must be enclosed by tree_mod_log_write_lock() and tree_mod_log_write_unlock() calls. Those calls are private to the tree_mod_log_* functions, which means that removal of the elements of an old root node must be logged from tree_mod_log_insert_root. This partly reverts and corrects commit ba1bfbd5 (Btrfs: fix a tree mod logging issue for root replacement operations). This fixes the brand-new version of xfstest 276 as of commit cfe73f71. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: use set_nlink if our i_nlink is 0Josef Bacik2013-04-051-1/+4
| | | | | | | | | | | | | | commit 9bf7a4890518186238d2579be16ecc5190a707c0 upstream. We need to inc the nlink of deleted entries when running replay so we can do the unlink on the fs_root and get everything cleaned up and then have the orphan cleanup do the right thing. The problem is inc_nlink complains about this, even thought it still does the right thing. So use set_nlink() if our i_nlink is 0 to keep users from seeing the warnings during log replay. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* userns: Restrict when proc and sysfs can be mountedEric W. Biederman2013-04-053-0/+29
| | | | | | | | | | | | | | | | | | | | | | | | | commit 87a8ebd637dafc255070f503909a053cf0d98d3f upstream. Only allow unprivileged mounts of proc and sysfs if they are already mounted when the user namespace is created. proc and sysfs are interesting because they have content that is per namespace, and so fresh mounts are needed when new namespaces are created while at the same time proc and sysfs have content that is shared between every instance. Respect the policy of who may see the shared content of proc and sysfs by only allowing new mounts if there was an existing mount at the time the user namespace was created. In practice there are only two interesting cases: proc and sysfs are mounted at their usual places, proc and sysfs are not mounted at all (some form of mount namespace jail). Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* vfs: Carefully propogate mounts across user namespacesEric W. Biederman2013-04-053-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | commit 132c94e31b8bca8ea921f9f96a57d684fa4ae0a9 upstream. As a matter of policy MNT_READONLY should not be changable if the original mounter had more privileges than creator of the mount namespace. Add the flag CL_UNPRIVILEGED to note when we are copying a mount from a mount namespace that requires more privileges to a mount namespace that requires fewer privileges. When the CL_UNPRIVILEGED flag is set cause clone_mnt to set MNT_NO_REMOUNT if any of the mnt flags that should never be changed are set. This protects both mount propagation and the initial creation of a less privileged mount namespace. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* vfs: Add a mount flag to lock read only bind mountsEric W. Biederman2013-04-051-0/+3
| | | | | | | | | | | | | | | | | commit 90563b198e4c6674c63672fae1923da467215f45 upstream. When a read-only bind mount is copied from mount namespace in a higher privileged user namespace to a mount namespace in a lesser privileged user namespace, it should not be possible to remove the the read-only restriction. Add a MNT_LOCK_READONLY mount flag to indicate that a mount must remain read-only. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* userns: Don't allow creation if the user is chrootedEric W. Biederman2013-04-051-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3151527ee007b73a0ebd296010f1c0454a919c7d upstream. Guarantee that the policy of which files may be access that is established by setting the root directory will not be violated by user namespaces by verifying that the root directory points to the root of the mount namespace at the time of user namespace creation. Changing the root is a privileged operation, and as a matter of policy it serves to limit unprivileged processes to files below the current root directory. For reasons of simplicity and comprehensibility the privilege to change the root directory is gated solely on the CAP_SYS_CHROOT capability in the user namespace. Therefore when creating a user namespace we must ensure that the policy of which files may be access can not be violated by changing the root directory. Anyone who runs a processes in a chroot and would like to use user namespace can setup the same view of filesystems with a mount namespace instead. With this result that this is not a practical limitation for using user namespaces. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>