Add ANDROID_OEM_DATA for implement of oem gki
Bug: 188749221
Bug: 233781098
Change-Id: I96b1c690fda172d0c490e944557a674a37620742
Signed-off-by: Yang Yang <yang.yang@vivo.com>
(cherry picked from commit 3a0675c6ca)
Add ANDROID_OEM_DATA to block_device_operations which allows a new
vendor specific function call.
Bug: 193106408
Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
Change-Id: I472f1cc25698c841841822908c4827545b8593df
Add sysfs files that expose the inline encryption capabilities of
request queues:
/sys/block/$disk/queue/crypto/max_dun_bits
/sys/block/$disk/queue/crypto/modes/$mode
/sys/block/$disk/queue/crypto/num_keyslots
Userspace can use these new files to decide what encryption settings to
use, or whether to use inline encryption at all. This also brings the
crypto capabilities in line with the other queue properties, which are
already discoverable via the queue directory in sysfs.
Design notes:
- Place the new files in a new subdirectory "crypto" to group them
together and to avoid complicating the main "queue" directory. This
also makes it possible to replace "crypto" with a symlink later if
we ever make the blk_crypto_profiles into real kobjects (see below).
- It was necessary to define a new kobject that corresponds to the
crypto subdirectory. For now, this kobject just contains a pointer
to the blk_crypto_profile. Note that multiple queues (and hence
multiple such kobjects) may refer to the same blk_crypto_profile.
An alternative design would more closely match the current kernel
data structures: the blk_crypto_profile could be a kobject itself,
located directly under the host controller device's kobject, while
/sys/block/$disk/queue/crypto would be a symlink to it.
I decided not to do that for now because it would require a lot more
changes, such as no longer embedding blk_crypto_profile in other
structures, and also because I'm not sure we can rule out moving the
crypto capabilities into 'struct queue_limits' in the future. (Even
if multiple queues share the same crypto engine, maybe the supported
data unit sizes could differ due to other queue properties.) It
would also still be possible to switch to that design later without
breaking userspace, by replacing the directory with a symlink.
- Use "max_dun_bits" instead of "max_dun_bytes". Currently, the
kernel internally stores this value in bytes, but that's an
implementation detail. It probably makes more sense to talk about
this value in bits, and choosing bits is more future-proof.
- "modes" is a sub-subdirectory, since there may be multiple supported
crypto modes, sysfs is supposed to have one value per file, and it
makes sense to group all the mode files together.
- Each mode had to be named. The crypto API names like "xts(aes)" are
not appropriate because they don't specify the key size. Therefore,
I assigned new names. The exact names chosen are arbitrary, but
they happen to match the names used in log messages in fs/crypto/.
- The "num_keyslots" file is a bit different from the others in that
it is only useful to know for performance reasons. However, it's
included as it can still be useful. For example, a user might not
want to use inline encryption if there aren't very many keyslots.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220124215938.2769-4-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 20f01f163203666010ee1560852590a0c0572726)
Conflicts:
Documentation/ABI/stable/sysfs-block
block/blk-sysfs.c
(dropped the documentation part)
Bug: 207390665
Change-Id: I959191599595aff62e5c0ca180365b2f589e0d6a
Signed-off-by: Eric Biggers <ebiggers@google.com>
This function is trivial and is only used in one place. Having this
function is misleading because it implies that blk_crypto_register()
needs to be paired with blk_crypto_unregister(), which is not the case.
Just set disk->queue->crypto_profile to NULL directly.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211124013733.347612-1-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 72cd9df2ef788d88c138d51223a01ca6281f232d)
Change-Id: Icf215db41f6b1cdc377f925b8150a47d62db18b8
blk_keyslot_manager is misnamed because it doesn't necessarily manage
keyslots. It actually does several different things:
- Contains the crypto capabilities of the device.
- Provides functions to control the inline encryption hardware.
Originally these were just for programming/evicting keyslots;
however, new functionality (hardware-wrapped keys) will require new
functions here which are unrelated to keyslots. Moreover,
device-mapper devices already (ab)use "keyslot_evict" to pass key
eviction requests to their underlying devices even though
device-mapper devices don't have any keyslots themselves (so it
really should be "evict_key", not "keyslot_evict").
- Sometimes (but not always!) it manages keyslots. Originally it
always did, but device-mapper devices don't have keyslots
themselves, so they use a "passthrough keyslot manager" which
doesn't actually manage keyslots. This hack works, but the
terminology is unnatural. Also, some hardware doesn't have keyslots
and thus also uses a "passthrough keyslot manager" (support for such
hardware is yet to be upstreamed, but it will happen eventually).
Let's stop having keyslot managers which don't actually manage keyslots.
Instead, rename blk_keyslot_manager to blk_crypto_profile.
This is a fairly big change, since for consistency it also has to update
keyslot manager-related function names, variable names, and comments --
not just the actual struct name. However it's still a fairly
straightforward change, as it doesn't change any actual functionality.
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20211018180453.40441-4-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit cb77cb5abe1f4fae4a33b735606aae22f9eaa1c7)
Conflicts:
block/blk-crypto.c
drivers/scsi/ufs/ufshcd-crypto.c
include/linux/blk-mq.h
Bug: 160883801
Bug: 162257402
Bug: 207390665
Bug: 234653003
Change-Id: I787cdc0d74baf5e4c94d73d5c467122bcc472649
Signed-off-by: Eric Biggers <ebiggers@google.com>
Changes in 5.15.36
fs: remove __sync_filesystem
block: remove __sync_blockdev
block: simplify the block device syncing code
vfs: make sync_filesystem return errors from ->sync_fs
xfs: return errors in xfs_fs_sync_fs
dma-mapping: remove bogus test for pfn_valid from dma_map_resource
arm64/mm: drop HAVE_ARCH_PFN_VALID
etherdevice: Adjust ether_addr* prototypes to silence -Wstringop-overead
mm: page_alloc: fix building error on -Werror=array-compare
perf tools: Fix segfault accessing sample_id xyarray
mm, kfence: support kmem_dump_obj() for KFENCE objects
gfs2: assign rgrp glock before compute_bitstructs
scsi: ufs: core: scsi_get_lba() error fix
net/sched: cls_u32: fix netns refcount changes in u32_change()
ALSA: usb-audio: Clear MIDI port active flag after draining
ALSA: hda/realtek: Add quirk for Clevo NP70PNP
ASoC: atmel: Remove system clock tree configuration for at91sam9g20ek
ASoC: topology: Correct error handling in soc_tplg_dapm_widget_create()
ASoC: rk817: Use devm_clk_get() in rk817_platform_probe
ASoC: msm8916-wcd-digital: Check failure for devm_snd_soc_register_component
ASoC: codecs: wcd934x: do not switch off SIDO Buck when codec is in use
dmaengine: idxd: fix device cleanup on disable
dmaengine: imx-sdma: Fix error checking in sdma_event_remap
dmaengine: mediatek:Fix PM usage reference leak of mtk_uart_apdma_alloc_chan_resources
dmaengine: dw-edma: Fix unaligned 64bit access
spi: spi-mtk-nor: initialize spi controller after resume
esp: limit skb_page_frag_refill use to a single page
spi: cadence-quadspi: fix incorrect supports_op() return value
igc: Fix infinite loop in release_swfw_sync
igc: Fix BUG: scheduling while atomic
igc: Fix suspending when PTM is active
ALSA: hda/hdmi: fix warning about PCM count when used with SOF
rxrpc: Restore removed timer deletion
net/smc: Fix sock leak when release after smc_shutdown()
net/packet: fix packet_sock xmit return value checking
ip6_gre: Avoid updating tunnel->tun_hlen in __gre6_xmit()
ip6_gre: Fix skb_under_panic in __gre6_xmit()
net: restore alpha order to Ethernet devices in config
net/sched: cls_u32: fix possible leak in u32_init_knode()
l3mdev: l3mdev_master_upper_ifindex_by_index_rcu should be using netdev_master_upper_dev_get_rcu
ipv6: make ip6_rt_gc_expire an atomic_t
can: isotp: stop timeout monitoring when no first frame was sent
net: dsa: hellcreek: Calculate checksums in tagger
net: mscc: ocelot: fix broken IP multicast flooding
netlink: reset network and mac headers in netlink_dump()
drm/i915/display/psr: Unset enable_psr2_sel_fetch if other checks in intel_psr2_config_valid() fails
net: stmmac: Use readl_poll_timeout_atomic() in atomic state
dmaengine: idxd: add RO check for wq max_batch_size write
dmaengine: idxd: add RO check for wq max_transfer_size write
dmaengine: idxd: skip clearing device context when device is read-only
selftests: mlxsw: vxlan_flooding: Prevent flooding of unwanted packets
arm64: mm: fix p?d_leaf()
ARM: vexpress/spc: Avoid negative array index when !SMP
reset: renesas: Check return value of reset_control_deassert()
reset: tegra-bpmp: Restore Handle errors in BPMP response
platform/x86: samsung-laptop: Fix an unsigned comparison which can never be negative
ALSA: usb-audio: Fix undefined behavior due to shift overflowing the constant
drm/msm/disp: check the return value of kzalloc()
arm64: dts: imx: Fix imx8*-var-som touchscreen property sizes
vxlan: fix error return code in vxlan_fdb_append
cifs: Check the IOCB_DIRECT flag, not O_DIRECT
net: atlantic: Avoid out-of-bounds indexing
mt76: Fix undefined behavior due to shift overflowing the constant
brcmfmac: sdio: Fix undefined behavior due to shift overflowing the constant
dpaa_eth: Fix missing of_node_put in dpaa_get_ts_info()
drm/msm/mdp5: check the return of kzalloc()
net: macb: Restart tx only if queue pointer is lagging
scsi: iscsi: Release endpoint ID when its freed
scsi: iscsi: Merge suspend fields
scsi: iscsi: Fix NOP handling during conn recovery
scsi: qedi: Fix failed disconnect handling
stat: fix inconsistency between struct stat and struct compat_stat
VFS: filename_create(): fix incorrect intent.
nvme: add a quirk to disable namespace identifiers
nvme-pci: disable namespace identifiers for the MAXIO MAP1002/1202
nvme-pci: disable namespace identifiers for Qemu controllers
EDAC/synopsys: Read the error count from the correct register
mm/memory-failure.c: skip huge_zero_page in memory_failure()
memcg: sync flush only if periodic flush is delayed
mm, hugetlb: allow for "high" userspace addresses
oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup
mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()
ata: pata_marvell: Check the 'bmdma_addr' beforing reading
dma: at_xdmac: fix a missing check on list iterator
dmaengine: imx-sdma: fix init of uart scripts
net: atlantic: invert deep par in pm functions, preventing null derefs
Input: omap4-keypad - fix pm_runtime_get_sync() error checking
scsi: sr: Do not leak information in ioctl
sched/pelt: Fix attach_entity_load_avg() corner case
perf/core: Fix perf_mmap fail when CONFIG_PERF_USE_VMALLOC enabled
drm/panel/raspberrypi-touchscreen: Avoid NULL deref if not initialised
drm/panel/raspberrypi-touchscreen: Initialise the bridge in prepare
KVM: PPC: Fix TCE handling for VFIO
drm/vc4: Use pm_runtime_resume_and_get to fix pm_runtime_get_sync() usage
powerpc/perf: Fix power9 event alternatives
powerpc/perf: Fix power10 event alternatives
perf script: Always allow field 'data_src' for auxtrace
perf report: Set PERF_SAMPLE_DATA_SRC bit for Arm SPE event
xtensa: patch_text: Fixup last cpu should be master
xtensa: fix a7 clobbering in coprocessor context load/store
openvswitch: fix OOB access in reserve_sfa_size()
gpio: Request interrupts after IRQ is initialized
ASoC: soc-dapm: fix two incorrect uses of list iterator
e1000e: Fix possible overflow in LTR decoding
ARC: entry: fix syscall_trace_exit argument
arm_pmu: Validate single/group leader events
KVM: x86/pmu: Update AMD PMC sample period to fix guest NMI-watchdog
KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a race
KVM: nVMX: Defer APICv updates while L2 is active until L1 is active
KVM: SVM: Flush when freeing encrypted pages even on SME_COHERENT CPUs
netfilter: conntrack: convert to refcount_t api
netfilter: conntrack: avoid useless indirection during conntrack destruction
ext4: fix fallocate to use file_modified to update permissions consistently
ext4: fix symlink file size not match to file content
ext4: fix use-after-free in ext4_search_dir
ext4: limit length to bitmap_maxbytes - blocksize in punch_hole
ext4, doc: fix incorrect h_reserved size
ext4: fix overhead calculation to account for the reserved gdt blocks
ext4: force overhead calculation if the s_overhead_cluster makes no sense
netfilter: nft_ct: fix use after free when attaching zone template
jbd2: fix a potential race while discarding reserved buffers after an abort
spi: atmel-quadspi: Fix the buswidth adjustment between spi-mem and controller
block/compat_ioctl: fix range check in BLKGETSIZE
arm64: dts: qcom: add IPA qcom,qmp property
Linux 5.15.36
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Iba23a60bda8fd07f26fd7f9217f208c2e6ee26c2
There are a lot of different structures that need to have a "frozen" abi
for the next 5+ years. Add padding to a lot of them in order to be able
to handle any future changes that might be needed due to LTS and
security fixes that might come up.
It's a best guess, based on what has happened in the past from the
5.10.0..5.10.110 release (1 1/2 years). Yes, past changes do not mean
that future changes will also be needed in the same area, but that is a
hint that those areas are both well maintained and looked after, and
there have been previous problems found in them.
Also the list of structures that are being required based on OEM usage
in the android/ symbol lists were consulted as that's a larger list than
what has been changed in the past.
Hopefully we caught everything we need to worry about, only time will
tell...
Bug: 151154716
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I880bbcda0628a7459988eeb49d18655522697664
commit 7a5428dcb7902700b830e912feee4e845df7c019 upstream.
Various block drivers call blk_set_queue_dying to mark a disk as dead due
to surprise removal events, but since commit 8e141f9eb8 that doesn't
work given that the GD_DEAD flag needs to be set to stop I/O.
Replace the driver calls to blk_set_queue_dying with a new (and properly
documented) blk_mark_disk_dead API, and fold blk_set_queue_dying into the
only remaining caller.
Fixes: 8e141f9eb8 ("block: drain file system I/O on del_gendisk")
Reported-by: Markus Blöchl <markus.bloechl@ipetronik.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20220217075231.1140-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e45c47d1f94e0cc7b6b079fdb4bcce2995e2adc4 upstream.
bio_start_io_acct_time() interface is like bio_start_io_acct() that
allows start_time to be passed in. This gives drivers the ability to
defer starting accounting until after IO is issued (but possibily not
entirely due to bio splitting).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220128155841.39644-2-snitzer@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ba0ffdd8ce48ad7f7e85191cd29f9674caca3745 ]
Particularly for NVMe with efficient deferred submission for many
requests, there are nice benefits to be seen by bumping the default max
plug count from 16 to 32. This is especially true for virtualized setups,
where the submit part is more expensive. But can be noticed even on
native hardware.
Reduce the multiple queue factor from 4 to 2, since we're changing the
default size.
While changing it, move the defines into the block layer private header.
These aren't values that anyone outside of the block layer uses, or
should use.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Pull SCSI updates from James Bottomley:
"This series consists of the usual driver updates (ufs, qla2xxx,
target, smartpqi, lpfc, mpt3sas).
The core change causing the most churn was replacing the command
request field request with a macro, allowing us to offset map to it
and remove the redundant field; the same was also done for the tag
field.
The most impactful change is the final removal of scsi_ioctl, which
has been deprecated for over a decade"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (293 commits)
scsi: ufs: Fix ufshcd_request_sense_async() for Samsung KLUFG8RHDA-B2D1
scsi: ufs: ufs-exynos: Fix static checker warning
scsi: mpt3sas: Use the proper SCSI midlayer interfaces for PI
scsi: lpfc: Use the proper SCSI midlayer interfaces for PI
scsi: lpfc: Copyright updates for 14.0.0.1 patches
scsi: lpfc: Update lpfc version to 14.0.0.1
scsi: lpfc: Add bsg support for retrieving adapter cmf data
scsi: lpfc: Add cmf_info sysfs entry
scsi: lpfc: Add debugfs support for cm framework buffers
scsi: lpfc: Add support for maintaining the cm statistics buffer
scsi: lpfc: Add rx monitoring statistics
scsi: lpfc: Add support for the CM framework
scsi: lpfc: Add cmfsync WQE support
scsi: lpfc: Add support for cm enablement buffer
scsi: lpfc: Add cm statistics buffer support
scsi: lpfc: Add EDC ELS support
scsi: lpfc: Expand FPIN and RDF receive logging
scsi: lpfc: Add MIB feature enablement support
scsi: lpfc: Add SET_HOST_DATA mbox cmd to pass date/time info to firmware
scsi: fc: Add EDC ELS definition
...
When merging one bio to request, if they are discard IO and the queue
supports multi-range discard, we need to return ELEVATOR_DISCARD_MERGE
because both block core and related drivers(nvme, virtio-blk) doesn't
handle mixed discard io merge(traditional IO merge together with
discard merge) well.
Fix the issue by returning ELEVATOR_DISCARD_MERGE in this situation,
so both blk-mq and drivers just need to handle multi-range discard.
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Fixes: 2705dfb209 ("block: fix discard request merge")
Link: https://lore.kernel.org/r/20210729034226.1591070-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the sg_timeout and sg_reserved_size fields into the bsg_device and
scsi_device structures as they have nothing to do with generic block I/O.
Note that these values are now separate for bsg vs. SCSI device node
access, but that just matches how /dev/sg vs the other nodes has always
behaved.
Link: https://lore.kernel.org/r/20210729064845.1044147-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pull block fixes from Jens Axboe:
- NVMe pull request (Christoph):
- tracing fix (Keith Busch)
- fix multipath head refcounting (Hannes Reinecke)
- Write Zeroes vs PI fix (me)
- drop a bogus WARN_ON (Zhihao Cheng)
- Increase max blk-cgroup policy size, now that mq-deadline
uses it too (Oleksandr)
* tag 'block-5.14-2021-07-24' of git://git.kernel.dk/linux-block:
nvme: set the PRACT bit when using Write Zeroes with T10 PI
nvme: fix nvme_setup_command metadata trace event
nvme: fix refcounting imbalance when all paths are down
nvme-pci: don't WARN_ON in nvme_reset_work if ctrl.state is not RESETTING
block: increase BLKCG_MAX_POLS
Pull more block updates from Jens Axboe:
"A combination of changes that ended up depending on both the driver
and core branch (and/or the IDE removal), and a few late arriving
fixes. In detail:
- Fix io ticks wrap-around issue (Chunguang)
- nvme-tcp sock locking fix (Maurizio)
- s390-dasd fixes (Kees, Christoph)
- blk_execute_rq polling support (Keith)
- blk-cgroup RCU iteration fix (Yu)
- nbd backend ID addition (Prasanna)
- Partition deletion fix (Yufen)
- Use blk_mq_alloc_disk for mmc, mtip32xx, ubd (Christoph)
- Removal of now dead block request types due to IDE removal
(Christoph)
- Loop probing and control device cleanups (Christoph)
- Device uevent fix (Christoph)
- Misc cleanups/fixes (Tetsuo, Christoph)"
* tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-block: (34 commits)
blk-cgroup: prevent rcu_sched detected stalls warnings while iterating blkgs
block: fix the problem of io_ticks becoming smaller
nvme-tcp: can't set sk_user_data without write_lock
loop: remove unused variable in loop_set_status()
block: remove the bdgrab in blk_drop_partitions
block: grab a device refcount in disk_uevent
s390/dasd: Avoid field over-reading memcpy()
dasd: unexport dasd_set_target_state
block: check disk exist before trying to add partition
ubd: remove dead code in ubd_setup_common
nvme: use return value from blk_execute_rq()
block: return errors from blk_execute_rq()
nvme: use blk_execute_rq() for passthrough commands
block: support polling through blk_execute_rq
block: remove REQ_OP_SCSI_{IN,OUT}
block: mark blk_mq_init_queue_data static
loop: rewrite loop_exit using idr_for_each_entry
loop: split loop_lookup
loop: don't allow deleting an unspecified loop device
loop: move loop_ctl_mutex locking into loop_add
...
Pull device mapper updates from Mike Snitzer:
- Various DM persistent-data library improvements and fixes that
benefit both the DM thinp and cache targets.
- A few small DM kcopyd efficiency improvements.
- Significant zoned related block core, DM core and DM zoned target
changes that culminate with adding zoned append emulation (which is
required to properly fix DM crypt's zoned support).
- Various DM writecache target changes that improve efficiency. Adds an
optional "metadata_only" feature that only promotes bios flagged with
REQ_META. But the most significant improvement is writecache's
ability to pause writeback, for a confiurable time, if/when the
working set is larger than the cache (and the cache is full) -- this
ensures performance is no worse than the slower origin device.
* tag 'for-5.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (35 commits)
dm writecache: make writeback pause configurable
dm writecache: pause writeback if cache full and origin being written directly
dm io tracker: factor out IO tracker
dm btree remove: assign new_root only when removal succeeds
dm zone: fix dm_revalidate_zones() memory allocation
dm ps io affinity: remove redundant continue statement
dm writecache: add optional "metadata_only" parameter
dm writecache: add "cleaner" and "max_age" to Documentation
dm writecache: write at least 4k when committing
dm writecache: flush origin device when writing and cache is full
dm writecache: have ssd writeback wait if the kcopyd workqueue is busy
dm writecache: use list_move instead of list_del/list_add in writecache_writeback()
dm writecache: commit just one block, not a full page
dm writecache: remove unused gfp_t argument from wc_add_block()
dm crypt: Fix zoned block device support
dm: introduce zone append emulation
dm: rearrange core declarations for extended use from dm-zone.c
block: introduce BIO_ZONE_WRITE_LOCKED bio flag
block: introduce bio zone helpers
block: improve handling of all zones reset operation
...
With the legacy IDE driver gone drivers now use either REQ_OP_DRV_*
or REQ_OP_SCSI_*, so unify the two concepts of passthrough requests
into a single one.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Introduce the helper functions bio_zone_no() and bio_zone_is_seq().
Both are the BIO counterparts of the request helpers blk_rq_zone_no()
and blk_rq_zone_is_seq(), respectively returning the number of the
target zone of a bio and true if the BIO target zone is sequential.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The tags used for an IO scheduler are currently per hctx.
As such, when q->nr_hw_queues grows, so does the request queue total IO
scheduler tag depth.
This may cause problems for SCSI MQ HBAs whose total driver depth is
fixed.
Ming and Yanhui report higher CPU usage and lower throughput in scenarios
where the fixed total driver tag depth is appreciably lower than the total
scheduler tag depth:
https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b
In that scenario, since the scheduler tag is got first, much contention
is introduced since a driver tag may not be available after we have got
the sched tag.
Improve this scenario by introducing request queue-wide tags for when
a tagset-wide sbitmap is used. The static sched requests are still
allocated per hctx, as requests are initialised per hctx, as in
blk_mq_init_request(..., hctx_idx, ...) ->
set->ops->init_request(.., hctx_idx, ...).
For simplicity of resizing the request queue sbitmap when updating the
request queue depth, just init at the max possible size, so we don't need
to deal with the possibly with swapping out a new sbitmap for old if
we need to grow.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block fixes from Jens Axboe:
- Fix for shared tag set exit (Bart)
- Correct ioctl range for zoned ioctls (Damien)
- Removed dead/unused function (Lin)
- Fix perf regression for shared tags (Ming)
- Fix out-of-bounds issue with kyber and preemption (Omar)
- BFQ merge fix (Paolo)
- Two error handling fixes for nbd (Sun)
- Fix weight update in blk-iocost (Tejun)
- NVMe pull request (Christoph):
- correct the check for using the inline bio in nvmet (Chaitanya
Kulkarni)
- demote unsupported command warnings (Chaitanya Kulkarni)
- fix corruption due to double initializing ANA state (me, Hou Pu)
- reset ns->file when open fails (Daniel Wagner)
- fix a NULL deref when SEND is completed with error in nvmet-rdma
(Michal Kalderon)
- Fix kernel-doc warning (Bart)
* tag 'block-5.13-2021-05-14' of git://git.kernel.dk/linux-block:
block/partitions/efi.c: Fix the efi_partition() kernel-doc header
blk-mq: Swap two calls in blk_mq_exit_queue()
blk-mq: plug request for shared sbitmap
nvmet: use new ana_log_size instead the old one
nvmet: seset ns->file when open fails
nbd: share nbd_put and return by goto put_nbd
nbd: Fix NULL pointer in flush_workqueue
blkdev.h: remove unused codes blk_account_rq
block, bfq: avoid circular stable merges
blk-iocost: fix weight updates of inner active iocgs
nvmet: demote fabrics cmd parse err msg to debug
nvmet: use helper to remove the duplicate code
nvmet: demote discovery cmd parse err msg to debug
nvmet-rdma: Fix NULL deref when SEND is completed with error
nvmet: fix inline bio check for passthru
nvmet: fix inline bio check for bdev-ns
nvme-multipath: fix double initialization of ANA state
kyber: fix out of bounds access when preempted
block: uapi: fix comment about block device ioctl
Pull block fix from Jens Axboe:
"Turns out the bio max size change still has issues, so let's get it
reverted for 5.13-rc1. We'll shake out the issues there and defer it
to 5.14 instead"
* tag 'block-5.13-2021-05-09' of git://git.kernel.dk/linux-block:
Revert "bio: limit bio max size"
Pull block fixes from Jens Axboe:
- dasd spelling fixes (Bhaskar)
- Limit bio max size on multi-page bvecs to the hardware limit, to
avoid overly large bio's (and hence latencies). Originally queued for
the merge window, but needed a fix and was dropped from the initial
pull (Changheun)
- NVMe pull request (Christoph):
- reset the bdev to ns head when failover (Daniel Wagner)
- remove unsupported command noise (Keith Busch)
- misc passthrough improvements (Kanchan Joshi)
- fix controller ioctl through ns_head (Minwoo Im)
- fix controller timeouts during reset (Tao Chiu)
- rnbd fixes/cleanups (Gioh, Md, Dima)
- Fix iov_iter re-expansion (yangerkun)
* tag 'block-5.13-2021-05-07' of git://git.kernel.dk/linux-block:
block: reexpand iov_iter after read/write
nvmet: remove unsupported command noise
nvme-multipath: reset bdev to ns head when failover
nvme-pci: fix controller reset hang when racing with nvme_timeout
nvme: move the fabrics queue ready check routines to core
nvme: avoid memset for passthrough requests
nvme: add nvme_get_ns helper
nvme: fix controller ioctl through ns_head
bio: limit bio max size
RDMA/rtrs: fix uninitialized symbol 'cnt'
s390: dasd: Mundane spelling fixes
block/rnbd: Remove all likely and unlikely
block/rnbd-clt: Check the return value of the function rtrs_clt_query
block/rnbd: Fix style issues
block/rnbd-clt: Change queue_depth type in rnbd_clt_session to size_t
bio size can grow up to 4GB when muli-page bvec is enabled.
but sometimes it would lead to inefficient behaviors.
in case of large chunk direct I/O, - 32MB chunk read in user space -
all pages for 32MB would be merged to a bio structure if the pages
physical addresses are contiguous. it makes some delay to submit
until merge complete. bio max size should be limited to a proper size.
When 32MB chunk read with direct I/O option is coming from userspace,
kernel behavior is below now in do_direct_IO() loop. it's timeline.
| bio merge for 32MB. total 8,192 pages are merged.
| total elapsed time is over 2ms.
|------------------ ... ----------------------->|
| 8,192 pages merged a bio.
| at this time, first bio submit is done.
| 1 bio is split to 32 read request and issue.
|--------------->
|--------------->
|--------------->
......
|--------------->
|--------------->|
total 19ms elapsed to complete 32MB read done from device. |
If bio max size is limited with 1MB, behavior is changed below.
| bio merge for 1MB. 256 pages are merged for each bio.
| total 32 bio will be made.
| total elapsed time is over 2ms. it's same.
| but, first bio submit timing is fast. about 100us.
|--->|--->|--->|---> ... -->|--->|--->|--->|--->|
| 256 pages merged a bio.
| at this time, first bio submit is done.
| and 1 read request is issued for 1 bio.
|--------------->
|--------------->
|--------------->
......
|--------------->
|--------------->|
total 17ms elapsed to complete 32MB read done from device. |
As a result, read request issue timing is faster if bio max size is limited.
Current kernel behavior with multipage bvec, super large bio can be created.
And it lead to delay first I/O request issue.
Signed-off-by: Changheun Lee <nanich.lee@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210503095203.29076-1-nanich.lee@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>