git.baikalelectronics.ru Git

mm: do_sync_mapping_range integrity fix

Chris Mason notices do_sync_mapping_range didn't actually ask for data
integrity writeout. Unfortunately, it is advertised as being usable for
data integrity operations.

This is a data integrity bug.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: write_cache_pages more terminate quickly

Now that we have the early-termination logic in place, it makes sense to
bail out early in all other cases where done is set to 1.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: write_cache_pages terminate quickly

Terminate the write_cache_pages loop upon encountering the first page past
end, without locking the page. Pages cannot have their index change when
we have a reference on them (truncate, eg truncate_inode_pages_range
performs the same check without the page lock).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: write_cache_pages optimise page cleaning

In write_cache_pages, if we get stuck behind another process that is
cleaning pages, we will be forced to wait for them to finish, then perform
our own writeout (if it was redirtied during the long wait), then wait for
that.

If a page under writeout is still clean, we can skip waiting for it (if
we're part of a data integrity sync, we'll be waiting for all writeout
pages afterwards, so we'll still be waiting for the other guy's write
that's cleaned the page).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: write_cache_pages cleanups

Get rid of some complex expressions from flow control statements, add a
comment, remove some duplicate code.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: write_cache_pages integrity fix

In write_cache_pages, nr_to_write is heeded even for data-integrity syncs,
so the function will return success after writing out nr_to_write pages,
even if that was not sufficient to guarantee data integrity.

The callers tend to set it to values that could break data interity
semantics easily in practice.  For example, nr_to_write can be set to
mapping->nr_pages * 2, however if a file has a single, dirty page, then
fsync is called, subsequent pages might be concurrently added and dirtied,
then write_cache_pages might writeout two of these newly dirty pages,
while not writing out the old page that should have been written out.

Fix this by ignoring nr_to_write if it is a data integrity sync.

This is a data integrity bug.

The reason this has been done in the past is to avoid stalling sync
operations behind page dirtiers.

"If a file has one dirty page at offset 1000000000000000 then someone
  does an fsync() and someone else gets in first and starts madly writing
  pages at offset 0, we want to write that page at 1000000000000000.
  Somehow."

What we do today is return success after an arbitrary amount of pages are
written, whether or not we have provided the data-integrity semantics that
the caller has asked for.  Even this doesn't actually fix all stall cases
completely: in the above situation, if the file has a huge number of pages
in pagecache (but not dirty), then mapping->nrpages is going to be huge,
even if pages are being dirtied.

This change does indeed make the possibility of long stalls lager, and
that's not a good thing, but lying about data integrity is even worse.  We
have to either perform the sync, or return -ELINUXISLAME so at least the
caller knows what has happened.

There are subsequent competing approaches in the works to solve the stall
problems properly, without compromising data integrity.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: write_cache_pages writepage error fix

In write_cache_pages, if ret signals a real error, but we still have some
pages left in the pagevec, done would be set to 1, but the remaining pages
would continue to be processed and ret will be overwritten in the process.

It could easily be overwritten with success, and thus success will be
returned even if there is an error. Thus the caller is told all writes
succeeded, wheras in reality some did not.

Fix this by bailing immediately if there is an error, and retaining the
first error code.

This is a data integrity bug.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: write_cache_pages early loop termination

We'd like to break out of the loop early in many situations, however the
existing code has been setting mapping->writeback_index past the final
page in the pagevec lookup for cyclic writeback.  This is a problem if we
don't process all pages up to the final page.

Currently the code mostly keeps writeback_index reasonable and hacked
around this by not breaking out of the loop or writing pages outside the
range in these cases.  Keep track of a real "done index" that enables us
to terminate the loop in a much more flexible manner.

Needed by the subsequent patch to preserve writepage errors, and then
further patches to break out of the loop early for other reasons.  However
there are no functional changes with this patch alone.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: write_cache_pages cyclic fix

In write_cache_pages, scanned == 1 is supposed to mean that cyclic
writeback has circled through zero, thus we should not circle again.
However it gets set to 1 after the first successful pagevec lookup.  This
leads to cases where not enough data gets written.

Counterexample: file with first 10 pages dirty, writeback_index == 5,
nr_to_write == 10.  Then the 5 last pages will be found, and scanned will
be set to 1, after writing those out, we will not cycle back to get the
first 5.

Rework this logic, now we'll always cycle unless we started off from index
0.  When cycling, only write out as far as 1 page before the start page
from the first cycle (so we don't write parts of the file twice).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

do_mpage_readpage(): don't submit lots of small bios on boundary

While tracing I/O patterns with blktrace (a great tool) a few weeks ago I
identified a minor issue in fs/mpage.c

As the comment above mpage_readpages() says, a fs's get_block function
will set BH_Boundary when it maps a block just before a block for which
extra I/O is required.

Since get_block() can map a range of pages, for all these pages the
BH_Boundary flag will be set. But we only need to push what I/O we have
accumulated at the last block of this range.

This makes do_mpage_readpage() send out the largest possible bio instead
of a bunch of page-sized ones in the BH_Boundary case.

Signed-off-by: Miquel van Smoorenburg <mikevs@xs4all.net>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

oom: print triggering task's cpuset and mems allowed

When cpusets are enabled, it's necessary to print the triggering task's
set of allowable nodes so the subsequently printed meminfo can be
interpreted correctly.

We also print the task's cpuset name for informational purposes.

[rientjes@google.com: task lock current before dereferencing cpuset]
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

oom: fix zone_scan_mutex name

zone_scan_mutex is actually a spinlock, so name it appropriately.

Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: invoke oom-killer from page fault

Rather than have the pagefault handler kill a process directly if it gets
a VM_FAULT_OOM, have it call into the OOM killer.

With increasingly sophisticated oom behaviour (cpusets, memory cgroups,
oom killing throttling, oom priority adjustment or selective disabling,
panic on oom, etc), it's silly to unconditionally kill the faulting
process at page fault time. Create a hook for pagefault oom path to call
into instead.

Only converted x86 and uml so far.

[akpm@linux-foundation.org: make __out_of_memory() static]
[akpm@linux-foundation.org: fix comment]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jeff Dike <jdike@addtoit.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: move_pages: no need to set pp->page to ZERO_PAGE(0) by default

pp->page is never used when not set to the right page, so there is no need
to set it to ZERO_PAGE(0) by default.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: rework do_pages_move() to work on page_sized chunks

Rework do_pages_move() to work by page-sized chunks of struct page_to_node
that are passed to do_move_page_to_node_array(). We now only have to
allocate a single page instead a possibly very large vmalloc area to store
all page_to_node entries.

As a result, new_page_node() will now have a very small lookup, hidding
much of the overall sys_move_pages() overhead.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: Nathalie Furmento <Nathalie.Furmento@labri.fr>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: don't mark_page_accessed in shmem_fault

Following "mm: don't mark_page_accessed in fault path", which now
places a mark_page_accessed() in zap_pte_range(), we should remove
the mark_page_accessed() from shmem_fault().

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Johannes Weiner <hannes@saeurebad.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: don't mark_page_accessed in fault path

Doing a mark_page_accessed at fault-time, then doing SetPageReferenced at
unmap-time if the pte is young has a number of problems.

mark_page_accessed is supposed to be roughly the equivalent of a young pte
for unmapped references. Unfortunately it doesn't come with any context:
after being called, reclaim doesn't know who or why the page was touched.

So calling mark_page_accessed not only adds extra lru or PG_referenced
manipulations for pages that are already going to have pte_young ptes anyway,
but it also adds these references which are difficult to work with from the
context of vma specific references (eg. MADV_SEQUENTIAL pte_young may not
wish to contribute to the page being referenced).

Then, simply doing SetPageReferenced when zapping a pte and finding it is
young, is not a really good solution either. SetPageReferenced does not
correctly promote the page to the active list for example. So after removing
mark_page_accessed from the fault path, several mmap()+touch+munmap() would
have a very different result from several read(2) calls for example, which
is not really desirable.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Johannes Weiner <hannes@saeurebad.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: report the MMU pagesize in /proc/pid/smaps

The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the
kernel to back a VMA.  This matches the size used by the MMU in the
majority of cases.  However, one counter-example occurs on PPC64 kernels
whereby a kernel using 64K as a base pagesize may still use 4K pages for
the MMU on older processor.  To distinguish, this patch reports
MMUPageSize as the pagesize used by the MMU in /proc/pid/smaps.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: report the pagesize backing a VMA in /proc/pid/smaps

It is useful to verify a hugepage-aware application is using the expected
pagesizes for its memory regions. This patch creates an entry called
KernelPageSize in /proc/pid/smaps that is the size of page used by the
kernel to back a VMA. The entry is not called PageSize as it is possible
the MMU uses a different size. This extension should not break any sensible
parser that skips lines containing unrecognised information.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm

* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm:
  dm snapshot: extend exception store functions
  dm snapshot: split out exception store implementations
  dm snapshot: rename struct exception_store
  dm snapshot: separate out exception store interface
  dm mpath: move trigger_event to system workqueue
  dm: add name and uuid to sysfs
  dm table: rework reference counting
  dm: support barriers on simple devices
  dm request: extend target interface
  dm request: add caches
  dm ioctl: allow dm_copy_name_and_uuid to return only one field
  dm log: ensure log bitmap fits on log device
  dm log: move region_size validation
  dm log: avoid reinitialising io_req on every operation
  dm: consolidate target deregistration error handling
  dm raid1: fix error count
  dm log: fix dm_io_client leak on error paths
  dm snapshot: change yield to msleep
  dm table: drop reference at unbind

dm snapshot: extend exception store functions

Supply dm_add_exception as a callback to the read_metadata function.
Add a status function ready for a later patch and name the functions
consistently.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

dm snapshot: split out exception store implementations

Move the existing snapshot exception store implementations out into
separate files. Later patches will place these behind a new
interface in preparation for alternative implementations.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>

dm snapshot: rename struct exception_store

Rename struct exception_store to dm_exception_store.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

dm snapshot: separate out exception store interface

Pull structures that bridge the gap between snapshot and
exception store out of dm-snap.h and put them in a new
.h file - dm-exception-store.h. This file will define the
API for new exception stores.

Ultimately, dm-snap.h is unnecessary, since only dm-snap.c
should be using it.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

dm mpath: move trigger_event to system workqueue

The same workqueue is used both for sending uevents and processing queued I/O.
Deadlock has been reported in RHEL5 when sending a uevent was blocked waiting
for the queued I/O to be processed. Use scheduled_work() for the asynchronous
uevents instead.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>

dm: add name and uuid to sysfs

Implement simple read-only sysfs entry for device-mapper block device.

This patch adds a simple sysfs directory named "dm" under block device
properties and implements
- name attribute (string containing mapped device name)
- uuid attribute (string containing UUID, or empty string if not set)

The kobject is embedded in mapped_device struct, so no additional
memory allocation is needed for initializing sysfs entry.

During the processing of sysfs attribute we need to lock mapped device
which is done by a new function dm_get_from_kobj, which returns the md
associated with kobject and increases the usage count.

Each 'show attribute' function is responsible for its own locking.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

dm table: rework reference counting

Rework table reference counting.

The existing code uses a reference counter. When the last reference is
dropped and the counter reaches zero, the table destructor is called.
Table reference counters are acquired/released from upcalls from other
kernel code (dm_any_congested, dm_merge_bvec, dm_unplug_all).
If the reference counter reaches zero in one of the upcalls, the table
destructor is called from almost random kernel code.

This leads to various problems:
* dm_any_congested being called under a spinlock, which calls the
  destructor, which calls some sleeping function.
* the destructor attempting to take a lock that is already taken by the
  same process.
* stale reference from some other kernel code keeps the table
  constructed, which keeps some devices open, even after successful
  return from "dmsetup remove". This can confuse lvm and prevent closing
  of underlying devices or reusing device minor numbers.

The patch changes reference counting so that the table destructor can be
called only at predetermined places.

The table has always exactly one reference from either mapped_device->map
or hash_cell->new_map. After this patch, this reference is not counted
in table->holders.  A pair of dm_create_table/dm_destroy_table functions
is used for table creation/destruction.

Temporary references from the other code increase table->holders. A pair
of dm_table_get/dm_table_put functions is used to manipulate it.

When the table is about to be destroyed, we wait for table->holders to
reach 0. Then, we call the table destructor.  We use active waiting with
msleep(1), because the situation happens rarely (to one user in 5 years)
and removing the device isn't performance-critical task: the user doesn't
care if it takes one tick more or not.

This way, the destructor is called only at specific points
(dm_table_destroy function) and the above problems associated with lazy
destruction can't happen.

Finally remove the temporary protection added to dm_any_congested().

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

dm: support barriers on simple devices

Implement barrier support for single device DM devices

This patch implements barrier support in DM for the common case of dm linear
just remapping a single underlying device. In this case we can safely
pass the barrier through because there can be no reordering between
devices.

NB. Any DM device might cease to support barriers if it gets
reconfigured so code must continue to allow for a possible
-EOPNOTSUPP on every barrier bio submitted. - agk

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

dm request: extend target interface

This patch adds the following target interfaces for request-based dm.

  map_rq    : for mapping a request

  rq_end_io : for finishing a request

  busy      : for avoiding performance regression from bio-based dm.
              Target can tell dm core not to map requests now, and
              that may help requests in the block layer queue to be
              bigger by I/O merging.
              In bio-based dm, this behavior is done by device
              drivers managing the block layer queue.
              But in request-based dm, dm core has to do that
              since dm core manages the block layer queue.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

dm request: add caches

This patch prepares some kmem_caches for request-based dm.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

dm ioctl: allow dm_copy_name_and_uuid to return only one field

Allow NULL buffer in dm_copy_name_and_uuid if you only want to return one of
the fields.

(Required by a following patch that adds these fields to sysfs.)

Signed-off-by: Milan Broz <mbroz@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

dm log: ensure log bitmap fits on log device

Check that the log bitmap will fit within the log device.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

dm log: move region_size validation

Move log size validation from mirror target to log constructor.

Removed PAGE_SIZE restriction we no longer think necessary.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

dm log: avoid reinitialising io_req on every operation

rw_header function updates three members of io_req data every time
when I/O is processed. bi_rw and notify.fn are never modified once
they get initialized, and so they can be set in advance.

header_to_disk() can also be pulled out of write_header() since only one
caller needs it and write_header() can be replaced by rw_header()
directly.

Signed-off-by: Takahiro Yasui <tyasui@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

dm: consolidate target deregistration error handling

Change dm_unregister_target to return void and use BUG() for error
reporting.

dm_unregister_target can only fail because of programming bug in the
target driver. It can't fail because of user's behavior or disk errors.

This patch changes unregister_target to return void and use BUG if
someone tries to unregister non-registered target or unregister target
that is in use.

This patch removes code duplication (testing of error codes in all dm
targets) and reports bugs in just one place, in dm_unregister_target. In
some target drivers, these return codes were ignored, which could lead
to a situation where bugs could be missed.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

dm raid1: fix error count

Always increase the error count when I/O on a leg of a mirror fails.

The error count is used to decide whether to select an alternative
mirror leg. If the target doesn't use the "handle_errors" feature, the
error count is not updated and the bio can get requeued forever by the
read callback.

Fix it by increasing error_count before the handle_errors feature
checking.

Cc: stable@kernel.org
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

dm log: fix dm_io_client leak on error paths

In create_log_context function, dm_io_client_destroy function needs
to be called, when memory allocation of disk_header, sync_bits and
recovering_bits failed, but dm_io_client_destroy is not called.

Cc: stable@kernel.org
Signed-off-by: Takahiro Yasui <tyasui@redhat.com>
Acked-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

dm snapshot: change yield to msleep

Change yield() to msleep(1). If the thread had realtime priority,
yield() doesn't really yield, so the yielding process would loop
indefinitely and cause machine lockup.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

dm table: drop reference at unbind

Move one dm_table_put() so that the last reference in the thread
gets dropped in __unbind().

This is required for a following patch,
dm-table-rework-reference-counting.patch, which will change the logic in
such a way that table destructor is called only at specific points in
the code.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

Merge branch 'for-next' of git://git.o-hand.com/linux-mfd

* 'for-next' of git://git.o-hand.com/linux-mfd: (30 commits)
  mfd: Fix section mismatch in da903x
  mfd: move drivers/i2c/chips/menelaus.c to drivers/mfd
  mfd: move drivers/i2c/chips/tps65010.c to drivers/mfd
  mfd: dm355evm msp430 driver
  mfd: Add missing break from wm3850-core
  mfd: Add WM8351 support
  mfd: Support configurable numbers of DCDCs and ISINKs on WM8350
  mfd: Handle missing WM8350 platform data
  mfd: Add WM8352 support
  mfd: Use irq_to_desc in twl4030 code
  power_supply: Add Dialog DA9030 battery charger driver
  mfd: Dialog DA9030 battery charger MFD driver
  mfd: Register WM8400 codec device
  mfd: Pass driver_data onto child devices
  mfd: Fix twl4030-core.c build error
  mfd: twl4030 regulator bug fixes
  mfd: twl4030: create some regulator devices
  mfd: twl4030: cleanup symbols and OMAP dependency
  mfd: twl4030: simplified child creation code
  power_supply: Add battery health reporting for WM8350
  ...

Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus

* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
  module: convert to stop_machine_create/destroy.
  stop_machine: introduce stop_machine_create/destroy.
  parisc: fix module loading failure of large kernel modules
  module: fix module loading failure of large kernel modules for parisc
  module: fix warning of unused function when !CONFIG_PROC_FS
  kernel/module.c: compare symbol values when marking symbols as exported in /proc/kallsyms.
  remove CONFIG_KMOD

Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  swiotlb: Don't include linux/swiotlb.h twice in lib/swiotlb.c
  intel-iommu: fix build error with INTR_REMAP=y and DMAR=n
  swiotlb: add missing __init annotations

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
  dlm: fs/dlm/ast.c: fix warning
  dlm: add new debugfs entry
  dlm: add time stamp of blocking callback
  dlm: change lock time stamping
  dlm: improve how bast mode handling
  dlm: remove extra blocking callback check
  dlm: replace schedule with cond_resched
  dlm: remove kmap/kunmap
  dlm: trivial annotation of be16 value
  dlm: fix up memory allocation flags

Merge branch 'i2c-next' of git://aeryn.fluff.org.uk/bjdooks/linux

* 'i2c-next' of git://aeryn.fluff.org.uk/bjdooks/linux:
  i2c-omap: fix type of irq handler function
  i2c-s3c2410: Change IRQ to be plain integer.
  i2c-s3c2410: Allow more than one i2c-s3c2410 adapter
  i2c-s3c2410: Remove default platform data.
  i2c-s3c2410: Use platform data for gpio configuration
  i2c-s3c2410: Fixup style problems from checkpatch.pl
  i2c-omap: Enable I2C wakeups for 34xx
  i2c-omap: reprogram OCP_SYSCONFIG register after reset
  i2c-omap: convert 'rev1' flag to generic 'rev' u8
  i2c-omap: fix I2C timeouts due to recursive omap_i2c_{un,}idle()
  i2c-omap: Clean-up i2c-omap
  i2c-omap: Don't compile in OMAP15xx I2C ISR for non-OMAP15xx builds
  i2c-omap: Mark init-only functions as __init
  i2c-omap: Add support for omap34xx
  i2c-omap: FIFO handling support and broken hw workaround for i2c-omap
  i2c-omap: Add high-speed support to omap-i2c
  i2c-omap: Close suspected race between omap_i2c_idle() and omap_i2c_isr()
  i2c-omap: Do not use interruptible wait call in omap_i2c_xfer_msg

Fix up apparently-trivial conflict in drivers/i2c/busses/i2c-s3c2410.c

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (22 commits)
  HID: fix error condition propagation in hid-sony driver
  HID: fix reference count leak hidraw
  HID: add proper support for pensketch 12x9 tablet
  HID: don't allow DealExtreme usb-radio be handled by usb hid driver
  HID: fix default Kconfig setting for TopSpeed driver
  HID: driver for TopSeed Cyberlink quirky remote
  HID: make boot protocol drivers depend on EMBEDDED
  HID: avoid sparse warning in HID_COMPAT_LOAD_DRIVER
  HID: hiddev cleanup -- handle all error conditions properly
  HID: force feedback driver for GreenAsia 0x12 PID
  HID: switch specialized drivers from "default y" to !EMBEDDED
  HID: set proper dev.parent in hidraw
  HID: add dynids facility
  HID: use GFP_KERNEL in hid_alloc_buffers
  HID: usbhid, use usb_endpoint_xfer_int
  HID: move usbhid flags to usbhid.h
  HID: add n-trig digitizer support
  HID: add phys and name ioctls to hidraw
  HID: struct device - replace bus_id with dev_name(), dev_set_name()
  HID: automatically call usbhid_set_leds in usbhid driver
  ...

Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw

* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (27 commits)
  GFS2: Use DEFINE_SPINLOCK
  GFS2: Fix use-after-free bug on umount (try #2)
  Revert "GFS2: Fix use-after-free bug on umount"
  GFS2: Streamline alloc calculations for writes
  GFS2: Send useful information with uevent messages
  GFS2: Fix use-after-free bug on umount
  GFS2: Remove ancient, unused code
  GFS2: Move four functions from super.c
  GFS2: Fix bug in gfs2_lock_fs_check_clean()
  GFS2: Send some sensible sysfs stuff
  GFS2: Kill two daemons with one patch
  GFS2: Move gfs2_recoverd into recovery.c
  GFS2: Fix "truncate in progress" hang
  GFS2: Clean up & move gfs2_quotad
  GFS2: Add more detail to debugfs glock dumps
  GFS2: Banish struct gfs2_rgrpd_host
  GFS2: Move rg_free from gfs2_rgrpd_host to gfs2_rgrpd
  GFS2: Move rg_igeneration into struct gfs2_rgrpd
  GFS2: Banish struct gfs2_dinode_host
  GFS2: Move i_size from gfs2_dinode_host and rename it to i_disksize
  ...

igb: fix anoying type mismatch warning on rx/tx queue sizing

When using "min()", the types of both sides should match. With the cpu
mask changes, the type of num_online_cpus() will now depend on config
options. Use "min_t()" with an explicit type instead.

And make the rx/tx case look the same too, just for sanity.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6: (30 commits)
  sparc: Fix minor SPARC32 compile error
  sparc: Remove reg*.h from Kbuild
  sparc: Clean arch-specific code in prom_common.c
  sparc: Kill asm/reg*.h
  sparc: Use 64BIT config entry
  MAINTAINERS: update sparc maintainer
  sparc: unify ipcbuf.h
  sparc: Update 64-bit defconfig.
  sparc: remove NO_PROC_ID - it is no longer used
  sparc: drop get_tbr() in traps.h
  sparc: fix warning in userspace header traps.h
  sparc: fix warnings in userspace header byteorder.h
  sparc: fix warning in userspace header jsflash.h
  sparc: unify openprom.h
  sparc64: delete unused linux_prom64_ranges from openprom_64.h
  sparc: prepare openprom for unification
  sparc: remove linux_prom_pci_assigned_addresses from openprom_32.h
  sparc: remove ebus definitions from openprom*.h
  sparc: unify siginfo.h
  sparc: unify ptrace.h
  ...

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (44 commits)
  qlge: Fix sparse warnings for tx ring indexes.
  qlge: Fix sparse warning regarding rx buffer queues.
  qlge: Fix sparse endian warning in ql_hw_csum_setup().
  qlge: Fix sparse endian warning for inbound packet control block flags.
  qlge: Fix sparse warnings for byte swapping in qlge_ethool.c
  myri10ge: print MAC and serial number on probe failure
  pkt_sched: cls_u32: Fix locking in u32_change()
  iucv: fix cpu hotplug
  af_iucv: Free iucv path/socket in path_pending callback
  af_iucv: avoid left over IUCV connections from failing connects
  af_iucv: New error return codes for connect()
  net/ehea: bitops work on unsigned longs
  Revert "net: Fix for initial link state in 2.6.28"
  tcp: Kill extraneous SPLICE_F_NONBLOCK checks.
  tcp: don't mask EOF and socket errors on nonblocking splice receive
  dccp: Integrate the TFRC library with DCCP
  dccp: Clean up ccid.c after integration of CCID plugins
  dccp: Lockless integration of CCID congestion-control plugins
  qeth: get rid of extra argument after printk to dev_* conversion
  qeth: No large send using EDDP for HiperSockets.
  ...

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
  ALSA: ice1724 - Fix a typo in IEC958 PCM name
  ASoC: fix davinci-sffsdr buglet
  ALSA: sound/usb: Use negated usb_endpoint_xfer_control, etc
  ALSA: hda - cxt5051 report jack state
  ALSA: hda - add basic jack reporting functions to patch_conexant.c
  ALSA: Use usb_set/get_intfdata
  ASoC: Clean up kerneldoc warnings
  ASoC: Fix pxa2xx-pcm checks for invalid DMA channels
  LSA: hda - Add HP Acacia detection
  ALSA: hda - fix name for ALC1200
  ALSA: sound/usb: use USB API functions rather than constants
  ASoC: TWL4030: DAPM based capture implementation
  ASoC: TWL4030: Make the enum filter generic for twl4030

Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq:
  [CPUFREQ] Fix on resume, now preserves user policy min/max.
  [CPUFREQ] Add Celeron Core support to p4-clockmod.
  [CPUFREQ] add to speedstep-lib additional fsb values for core processors
  [CPUFREQ] Disable sysfs ui for p4-clockmod.
  [CPUFREQ] p4-clockmod: reduce noise
  [CPUFREQ] clean up speedstep-centrino and reduce cpumask_t usage

Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2

* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: (138 commits)
  ocfs2: Access the right buffer_head in ocfs2_merge_rec_left.
  ocfs2: use min_t in ocfs2_quota_read()
  ocfs2: remove unneeded lvb casts
  ocfs2: Add xattr support checking in init_security
  ocfs2: alloc xattr bucket in ocfs2_xattr_set_handle
  ocfs2: calculate and reserve credits for xattr value in mknod
  ocfs2/xattr: fix credits calculation during index create
  ocfs2/xattr: Always updating ctime during xattr set.
  ocfs2/xattr: Remove extend_trans call and add its credits from the beginning
  ocfs2/dlm: Fix race during lockres mastery
  ocfs2/dlm: Fix race in adding/removing lockres' to/from the tracking list
  ocfs2/dlm: Hold off sending lockres drop ref message while lockres is migrating
  ocfs2/dlm: Clean up errors in dlm_proxy_ast_handler()
  ocfs2/dlm: Fix a race between migrate request and exit domain
  ocfs2: One more hamming code optimization.
  ocfs2: Another hamming code optimization.
  ocfs2: Don't hand-code xor in ocfs2_hamming_encode().
  ocfs2: Enable metadata checksums.
  ocfs2: Validate superblock with checksum and ecc.
  ocfs2: Checksum and ECC for directory blocks.
  ...

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  inotify: fix type errors in interfaces
  fix breakage in reiserfs_new_inode()
  fix the treatment of jfs special inodes
  vfs: remove duplicate code in get_fs_type()
  add a vfs_fsync helper
  sys_execve and sys_uselib do not call into fsnotify
  zero i_uid/i_gid on inode allocation
  inode->i_op is never NULL
  ntfs: don't NULL i_op
  isofs check for NULL ->i_op in root directory is dead code
  affs: do not zero ->i_op
  kill suid bit only for regular files
  vfs: lseek(fd, 0, SEEK_CUR) race condition

mm lockless pagecache barrier fix

An XFS workload showed up a bug in the lockless pagecache patch. Basically it
would go into an "infinite" loop, although it would sometimes be able to break
out of the loop! The reason is a missing compiler barrier in the "increment
reference count unless it was zero" case of the lockless pagecache protocol in
the gang lookup functions.

This would cause the compiler to use a cached value of struct page pointer to
retry the operation with, rather than reload it. So the page might have been
removed from pagecache and freed (refcount==0) but the lookup would not correctly
notice the page is no longer in pagecache, and keep attempting to increment the
refcount and failing, until the page gets reallocated for something else. This
isn't a data corruption because the condition will be detected if the page has
been reallocated. However it can result in a lockup.

Linus points out that ACCESS_ONCE is also required in that pointer load, even
if it's absence is not causing a bug on our particular build. The most general
way to solve this is just to put an rcu_dereference in radix_tree_deref_slot.

Assembly of find_get_pages,
before:
.L220:
        movq    (%rbx), %rax    #* ivtmp.1162, tmp82
        movq    (%rax), %rdi    #, prephitmp.1149
.L218:
        testb   $1, %dil        #, prephitmp.1149
        jne     .L217   #,
        testq   %rdi, %rdi      # prephitmp.1149
        je      .L203   #,
        cmpq    $-1, %rdi       #, prephitmp.1149
        je      .L217   #,
        movl    8(%rdi), %esi   # <variable>._count.counter, c
        testl   %esi, %esi      # c
        je      .L218   #,

after:
.L212:
        movq    (%rbx), %rax    #* ivtmp.1109, tmp81
        movq    (%rax), %rdi    #, ret
        testb   $1, %dil        #, ret
        jne     .L211   #,
        testq   %rdi, %rdi      # ret
        je      .L197   #,
        cmpq    $-1, %rdi       #, ret
        je      .L211   #,
        movl    8(%rdi), %esi   # <variable>._count.counter, c
        testl   %esi, %esi      # c
        je      .L212   #,

(notice the obvious infinite loop in the first example, if page->count remains 0)

Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

i2o: Update my address

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

qlge: Fix sparse warnings for tx ring indexes.

Warnings:
drivers/net/qlge/qlge_main.c:1474:34: warning: restricted degrades to integer
drivers/net/qlge/qlge_main.c:1475:36: warning: restricted degrades to integer
drivers/net/qlge/qlge_main.c:1592:51: warning: restricted degrades to integer
drivers/net/qlge/qlge_main.c:1941:20: warning: incorrect type in assignment (different base types)
drivers/net/qlge/qlge_main.c:1941:20:    expected restricted unsigned int [usertype] tid
drivers/net/qlge/qlge_main.c:1941:20:    got int [signed] index
drivers/net/qlge/qlge_main.c:1945:24: warning: incorrect type in assignment (different base types)
drivers/net/qlge/qlge_main.c:1945:24:    expected restricted unsigned int [usertype] txq_idx
drivers/net/qlge/qlge_main.c:1945:24:    got unsigned int [unsigned] [usertype] tx_ring_idx

Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qlge: Fix sparse warning regarding rx buffer queues.

Warnings:
drivers/net/qlge/qlge_main.c:909:17: warning: incorrect type in assignment (different base types)
drivers/net/qlge/qlge_main.c:909:17:    expected unsigned int [unsigned] [usertype] addr_lo
drivers/net/qlge/qlge_main.c:909:17:    got restricted unsigned int [usertype] <noident>
drivers/net/qlge/qlge_main.c:911:17: warning: incorrect type in assignment (different base types)
drivers/net/qlge/qlge_main.c:911:17:    expected unsigned int [unsigned] [usertype] addr_hi
drivers/net/qlge/qlge_main.c:911:17:    got restricted unsigned int [usertype] <noident>
drivers/net/qlge/qlge_main.c:974:17: warning: incorrect type in assignment (different base types)
drivers/net/qlge/qlge_main.c:974:17:    expected unsigned int [unsigned] [usertype] addr_lo
drivers/net/qlge/qlge_main.c:974:17:    got restricted unsigned int [usertype] <noident>
drivers/net/qlge/qlge_main.c:975:17: warning: incorrect type in assignment (different base types)
drivers/net/qlge/qlge_main.c:975:17:    expected unsigned int [unsigned] [usertype] addr_hi
drivers/net/qlge/qlge_main.c:975:17:    got restricted unsigned int [usertype] <noident>
drivers/net/qlge/qlge_main.c:2132:16: warning: incorrect type in assignment (different base types)
drivers/net/qlge/qlge_main.c:2132:16:    expected unsigned int [unsigned] [usertype] addr_lo
drivers/net/qlge/qlge_main.c:2132:16:    got restricted unsigned int [usertype] <noident>
drivers/net/qlge/qlge_main.c:2133:16: warning: incorrect type in assignment (different base types)
drivers/net/qlge/qlge_main.c:2133:16:    expected unsigned int [unsigned] [usertype] addr_hi
drivers/net/qlge/qlge_main.c:2133:16:    got restricted unsigned int [usertype] <noident>
drivers/net/qlge/qlge_main.c:2212:15: warning: incorrect type in assignment (different base types)
drivers/net/qlge/qlge_main.c:2212:15:    expected unsigned int [unsigned] [usertype] addr_lo
drivers/net/qlge/qlge_main.c:2212:15:    got restricted unsigned int [usertype] <noident>
drivers/net/qlge/qlge_main.c:2214:15: warning: incorrect type in assignment (different base types)
drivers/net/qlge/qlge_main.c:2214:15:    expected unsigned int [unsigned] [usertype] addr_hi
drivers/net/qlge/qlge_main.c:2214:15:    got restricted unsigned int [usertype] <noident>

Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qlge: Fix sparse endian warning in ql_hw_csum_setup().

Changed u16 to __sum16 usage.

Warnings:
drivers/net/qlge/qlge_main.c:1897:9: warning: incorrect type in assignment (different base types)
drivers/net/qlge/qlge_main.c:1897:9:    expected unsigned short [usertype] *check
drivers/net/qlge/qlge_main.c:1897:9:    got restricted unsigned short *<noident>
drivers/net/qlge/qlge_main.c:1903:9: warning: incorrect type in assignment (different base types)
drivers/net/qlge/qlge_main.c:1903:9:    expected unsigned short [usertype] *check
drivers/net/qlge/qlge_main.c:1903:9:    got restricted unsigned short *<noident>
drivers/net/qlge/qlge_main.c:1909:9: warning: incorrect type in assignment (different base types)
drivers/net/qlge/qlge_main.c:1909:9:    expected unsigned short [unsigned] [short] [usertype] <noident>

Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qlge: Fix sparse endian warning for inbound packet control block flags.

Changed flags element from __le32 to 3 reserved bytes and one byte of
flags. Changed flags bit definitions to reflect byte width instead of
__le32 width.

Warnings:
drivers/net/qlge/qlge_main.c:1206:16: warning: restricted degrades to integer
drivers/net/qlge/qlge_main.c:1207:16: warning: restricted degrades to integer
drivers/net/qlge/qlge_main.c:1233:17: warning: restricted degrades to integer
drivers/net/qlge/qlge_main.c:1276:17: warning: restricted degrades to integer
drivers/net/qlge/qlge_main.c:1349:19: warning: restricted degrades to integer

Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qlge: Fix sparse warnings for byte swapping in qlge_ethool.c

drivers/net/qlge/qlge_ethtool.c:59:23: warning: cast to restricted type
drivers/net/qlge/qlge_ethtool.c:59:21: warning: incorrect type in assignment (different base types)
drivers/net/qlge/qlge_ethtool.c:59:21:    expected restricted unsigned short [usertype] irq_delay
drivers/net/qlge/qlge_ethtool.c:59:21:    got unsigned short [unsigned] [usertype] <noident>
drivers/net/qlge/qlge_ethtool.c:61:8: warning: cast to restricted type
drivers/net/qlge/qlge_ethtool.c:60:21: warning: incorrect type in assignment (different base types)
drivers/net/qlge/qlge_ethtool.c:60:21:    expected restricted unsigned short [usertype] pkt_delay
drivers/net/qlge/qlge_ethtool.c:60:21:    got unsigned short [unsigned] [usertype] <noident>
drivers/net/qlge/qlge_ethtool.c:82:23: warning: cast to restricted type
drivers/net/qlge/qlge_ethtool.c:82:21: warning: incorrect type in assignment (different base types)
drivers/net/qlge/qlge_ethtool.c:82:21:    expected restricted unsigned short [usertype] irq_delay
drivers/net/qlge/qlge_ethtool.c:82:21:    got unsigned short [unsigned] [usertype] <noident>
drivers/net/qlge/qlge_ethtool.c:84:8: warning: cast to restricted type
drivers/net/qlge/qlge_ethtool.c:83:21: warning: incorrect type in assignment (different base types)
drivers/net/qlge/qlge_ethtool.c:83:21:    expected restricted unsigned short [usertype] pkt_delay
drivers/net/qlge/qlge_ethtool.c:83:21:    got unsigned short [unsigned] [usertype] <noident>

Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

myri10ge: print MAC and serial number on probe failure

To help board identification and diagnosis, print the MAC
and serial number on probe failure if they are available.

Signed-off-by: Brice Goglin <brice@myri.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

pkt_sched: cls_u32: Fix locking in u32_change()

New nodes are inserted in u32_change() under rtnl_lock() with wmb(),
so without tcf_tree_lock() like in other classifiers (e.g. cls_fw).
This isn't enough without rmb() on the read side, but on the other
hand adding such barriers doesn't give any savings, so the lock is
added instead.

Reported-by: m0sia <m0sia@plotinka.ru>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sparc: Fix minor SPARC32 compile error

When CONFIG_PROC_FS is unset, include/linux/interrupt.h defines
init_irq_proc() as an empty function.

arch/sparc/kernel/irq_32.c defines this function unconditionally.

Fix the latter so that it only defines this function when CONFIG_PROC_FS
is set.

This fixes the following error:
arch/sparc/kernel/irq_32.c:672: error: redefinition of 'init_irq_proc'
include/linux/interrupt.h:461: error: previous definition of
'init_irq_proc' was here

This was found using randconfig builds.

Signed-off-by: Julian Calaby <julian.calaby@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

iucv: fix cpu hotplug

If the iucv module is compiled in/loaded but no user is registered cpu
hot remove doesn't work. Reason for that is that the iucv cpu hotplug
notifier on CPU_DOWN_PREPARE checks if the iucv_buffer_cpumask would
be empty after the corresponding bit would be cleared. However the bit
was never set since iucv wasn't enable. That causes all cpu hot unplug
operations to fail in this scenario.
To fix this use iucv_path_table as an indicator wether iucv is enabled
or not.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

af_iucv: Free iucv path/socket in path_pending callback

Free iucv path after iucv_path_sever() calls in iucv_callback_connreq()
(path_pending() iucv callback).
If iucv_path_accept() fails, free path and free/kill newly created socket.

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

af_iucv: avoid left over IUCV connections from failing connects

For certain types of AFIUCV socket connect failures IUCV connections
are left over. Add some cleanup-statements to avoid cluttered IUCV
connections.

Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

af_iucv: New error return codes for connect()

If the iucv_path_connect() call fails then return an error code that
corresponds to the iucv_path_connect() failure condition; instead of
returning -ECONNREFUSED for any failure.

This helps to improve error handling for user space applications
(e.g. inform the user that the z/VM guest is not authorized to
connect to other guest virtual machines).

The error return codes are based on those described in connect(2).

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mm: update my address

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

X86_DEBUGCTLMSR won't work on uml

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

uml got broken by commit 30742d5c2277c325fb0e9d2d817d55a19995fe8f

... if you revert a commit, revert the fixups elsewhere that had been
triggered by it. Such as 8c56250f48347750c82ab18d98d647dcf99ca674
(lockdep, UML: fix compilation when CONFIG_TRACE_IRQFLAGS_SUPPORT is not set).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

get rid of the last symlink in uml build

We need to make asm-offsets.h contents visible for objects built
with userland headers. Instead of creating a symlink, just have the
file with equivalent include (relative to location of header) created
once. That kills the last symlink used in arch/um builds.

Additionally, both generated headers can become dependencies of
archprepare now, killing the misuse of prepare.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

net/ehea: bitops work on unsigned longs

The flags field of struct ehea_port is only used with test_bit(),
clear_bit() and set_bit() and these interfaces only work on
"unsigned long"s, so change the field to be an "unsigned long". Also,
this field only has two bits defined for it (0 and 1) so will still be
fine if someone builds this driver for a 32 bit arch (at least as far as
this flags field is concerned).

Also note that ehea_driver_flags is only used in ehca_main.c, so make it
static in there.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

Revert "net: Fix for initial link state in 2.6.28"

This reverts commit 22604c866889c4b2e12b73cbf1683bda1b72a313.

We can't fix this issue in this way, because we now can try
to take the dev_base_lock rwlock as a writer in software interrupt
context and that is not allowed without major surgery elsewhere.

This initial link state problem needs to be solved in some other
way.

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'topic/misc' into for-linus

Merge branch 'topic/asoc' into for-linus

Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound-2.6 into topic/asoc

ALSA: ice1724 - Fix a typo in IEC958 PCM name

Fix trivial name string typo as reported in bug 2552.

Signed-off-by: Alan Horstmann <gineera@aspect135.co.uk>
Signed-off-by: Takashi Iwai <tiwai@suse.de>

inotify: fix type errors in interfaces

The problems lie in the types used for some inotify interfaces, both at the kernel level and at the glibc level. This mail addresses the kernel problem. I will follow up with some suggestions for glibc changes.

For the sys_inotify_rm_watch() interface, the type of the 'wd' argument is
currently 'u32', it should be '__s32' .  That is Robert's suggestion, and
is consistent with the other declarations of watch descriptors in the
kernel source, in particular, the inotify_event structure in
include/linux/inotify.h:

struct inotify_event {
        __s32           wd;             /* watch descriptor */
        __u32           mask;           /* watch mask */
        __u32           cookie;         /* cookie to synchronize two events */
        __u32           len;            /* length (including nulls) of name */
        char            name[0];        /* stub for possible name */
};

The patch makes the changes needed for inotify_rm_watch().

Signed-off-by: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Robert Love <rlove@google.com>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

fix breakage in reiserfs_new_inode()

now that we use ih.key earlier, we need to do all its setup early enough

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

fix the treatment of jfs special inodes

We used to put them on a single list, without any locking. Racy.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

vfs: remove duplicate code in get_fs_type()

save 14 bytes:

   text    data     bss     dec     hex filename
   1354      32       4    1390     56e fs/filesystems.o.before
   text    data     bss     dec     hex filename
   1340      32       4    1376     560 fs/filesystems.o

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

add a vfs_fsync helper

Fsync currently has a fdatawrite/fdatawait pair around the method call,
and a mutex_lock/unlock of the inode mutex.  All callers of fsync have
to duplicate this, but we have a few and most of them don't quite get
it right.  This patch adds a new vfs_fsync that takes care of this.
It's a little more complicated as usual as ->fsync might get a NULL file
pointer and just a dentry from nfsd, but otherwise gets afile and we
want to take the mapping and file operations from it when it is there.

Notes on the fsync callers:

- ecryptfs wasn't calling filemap_fdatawrite / filemap_fdatawait on the
    lower file
- coda wasn't calling filemap_fdatawrite / filemap_fdatawait on the host
file, and returning 0 when ->fsync was missing
- shm wasn't calling either filemap_fdatawrite / filemap_fdatawait nor
   taking i_mutex.  Now given that shared memory doesn't have disk
   backing not doing anything in fsync seems fine and I left it out of
   the vfs_fsync conversion for now, but in that case we might just
   not pass it through to the lower file at all but just call the no-op
   simple_sync_file directly.

[and now actually export vfs_fsync]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

sys_execve and sys_uselib do not call into fsnotify

sys_execve and sys_uselib do not call into fsnotify so inotify does not get
open events for these types of syscalls. This patch simply makes the
requisite fsnotify calls.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

zero i_uid/i_gid on inode allocation

... and don't bother in callers. Don't bother with zeroing i_blocks,
while we are at it - it's already been zeroed.

i_mode is not worth the effort; it has no common default value.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

inode->i_op is never NULL

We used to have rather schizophrenic set of checks for NULL ->i_op even
though it had been eliminated years ago. You'd need to go out of your
way to set it to NULL explicitly _and_ a bunch of code would die on
such inodes anyway. After killing two remaining places that still
did that bogosity, all that crap can go away.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ntfs: don't NULL i_op

it's already set to empty table (and no, ntfs doesn't have any explicit
checks for NULL ->i_op or NULL ->i_fop)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

isofs check for NULL ->i_op in root directory is dead code

for one thing it never happens, for another we check that inode
is a directory right after that place anyway (and we'd already
checked that reading it from disk has not failed).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

affs: do not zero ->i_op

it is already set to empty table and should never be NULL

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

kill suid bit only for regular files

We don't have to do it because it is useless for non regular files.
In fact block device may trigger this path without dentry->d_inode->i_mutex.

(akpm: concerns were expressed (by me) about S_ISDIR inodes)

Signed-off-by: Dmitri Monakhov <dmonakhov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

vfs: lseek(fd, 0, SEEK_CUR) race condition

This patch fixes a race condition in lseek. While it is expected that
unpredictable behaviour may result while repositioning the offset of a
file descriptor concurrently with reading/writing to the same file
descriptor, this should not happen when merely *reading* the file
descriptor's offset.

Unfortunately, the only portable way in Unix to read a file
descriptor's offset is lseek(fd, 0, SEEK_CUR); however executing this
concurrently with read/write may mess up the position.

[with fixes from akpm]

Signed-off-by: Alain Knaff <alain@knaff.lu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ocfs2: Access the right buffer_head in ocfs2_merge_rec_left.

In commit "ocfs2: Use metadata-specific ocfs2_journal_access_*()
functions", the wrong buffer_head is accessed. So change it
to the right buffer_head.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>

ocfs2: use min_t in ocfs2_quota_read()

This is preferred to min().

Signed-off-by: Mark Fasheh <mfasheh@suse.com>

ocfs2: remove unneeded lvb casts

dlmglue.c has lots of code which casts the return value of ocfs2_dlm_lvb().
This is pointless however, as ocfs2_dlm_lvb() returns void *.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>

ocfs2: Add xattr support checking in init_security

We must check whether ocfs2 volume support xattr in init_security,
if not support xattr and security is enable, would cause failure of mknod.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>

ocfs2: alloc xattr bucket in ocfs2_xattr_set_handle

In extreme situation, may need xattr bucket for setting
security entry and acl entries during mknod. This only
happens when block size is too small.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>

ocfs2: calculate and reserve credits for xattr value in mknod

We extend the credits for xattr's large value in set_value_outside
before, this can give rise to a credits issue when we set one security
entry and two acl entries duing mknod. As we remove extend_trans form
set_value_outside, we must calculate and reserve the credits for
xattr's large value in mknod.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>

ocfs2/xattr: fix credits calculation during index create

When creating a xattr index block, the old calculation forget
to add credits for the meta change of the alloc file. So add
more credits and more comments to explain it.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>

ocfs2/xattr: Always updating ctime during xattr set.

In xattr set, we should always update ctime if the operation goes
sucessfully. The old one mistakenly put it in ocfs2_xattr_set_entry
which is only called when we set xattr in inode or xattr block. The
side benefit is that it resolve the bug 1052 since in that scenario,
ocfs2_calc_xattr_set_need only calc out the xattr set credits while
ocfs2_xattr_set_entry update the inode also which isn't concerned with
the process of xattr set.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>

ocfs2/xattr: Remove extend_trans call and add its credits from the beginning

Actually, when setting a new xattr value, we know it from the very
beginning, and it isn't like the extension of bucket in which case
we can't figure it out. So remove ocfs2_extend_trans in that function
and calculate it before the transaction. It also relieve acl operation
from the worry about the side effect of ocfs2_extend_trans.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>

ocfs2/dlm: Fix race during lockres mastery

dlm_get_lock_resource() is supposed to return a lock resource with a proper
master. If multiple concurrent threads attempt to lookup the lockres for the
same lockid while the lock mastery in underway, one or more threads are likely
to return a lockres without a proper master.

This patch makes the threads wait in dlm_get_lock_resource() while the mastery
is underway, ensuring all threads return the lockres with a proper master.

This issue is known to be limited to users using the flock() syscall. For all
other fs operations, the ocfs2 dlmglue layer serializes the dlm op for each
lockid.

Users encountering this bug will see flock() return EINVAL and dmesg have the
following error:
ERROR: Dlm error "DLM_BADARGS" while calling dlmlock on resource <LOCKID>: bad api args

Reported-by: Coly Li <coyli@suse.de>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>