kernel_arpi

Author	SHA1	Message	Date
Greg Kroah-Hartman	53b9968c9b	Merge 5.4.70 into android11-5.4-lts Changes in 5.4.70 btrfs: fix filesystem corruption after a device replace mmc: sdhci: Workaround broken command queuing on Intel GLK based IRBIS models USB: gadget: f_ncm: Fix NDP16 datagram validation gpio: siox: explicitly support only threaded irqs gpio: mockup: fix resource leak in error path gpio: tc35894: fix up tc35894 interrupt configuration clk: socfpga: stratix10: fix the divider for the emac_ptp_free_clk vsock/virtio: add transport parameter to the virtio_transport_reset_no_sock() net: virtio_vsock: Enhance connection semantics xfs: trim IO to found COW extent limit Input: i8042 - add nopnp quirk for Acer Aspire 5 A515 iio: adc: qcom-spmi-adc5: fix driver name ftrace: Move RCU is watching check after recursion check memstick: Skip allocating card when removing host drm/amdgpu: restore proper ref count in amdgpu_display_crtc_set_config clocksource/drivers/timer-gx6605s: Fixup counter reload libbpf: Remove arch-specific include path in Makefile drivers/net/wan/hdlc_fr: Add needed_headroom for PVC devices drm/sun4i: mixer: Extend regmap max_register net: dec: de2104x: Increase receive ring size for Tulip rndis_host: increase sleep time in the query-response loop nvme-core: get/put ctrl and transport module in nvme_dev_open/release() fuse: fix the ->direct_IO() treatment of iov_iter drivers/net/wan/lapbether: Make skb->protocol consistent with the header drivers/net/wan/hdlc: Set skb->protocol before transmitting mac80211: Fix radiotap header channel flag for 6GHz band mac80211: do not allow bigger VHT MPDUs than the hardware supports tracing: Make the space reserved for the pid wider tools/io_uring: fix compile breakage spi: fsl-espi: Only process interrupts for expected events nvme-pci: fix NULL req in completion handler nvme-fc: fail new connections to a deleted host or remote port gpio: sprd: Clear interrupt when setting the type as edge phy: ti: am654: Fix a leak in serdes_am654_probe() pinctrl: mvebu: Fix i2c sda definition for 98DX3236 nfs: Fix security label length not being reset clk: tegra: Always program PLL_E when enabled clk: samsung: exynos4: mark 'chipid' clock as CLK_IGNORE_UNUSED iommu/exynos: add missing put_device() call in exynos_iommu_of_xlate() gpio/aspeed-sgpio: enable access to all 80 input & output sgpios gpio/aspeed-sgpio: don't enable all interrupts by default gpio: aspeed: fix ast2600 bank properties i2c: cpm: Fix i2c_ram structure Input: trackpoint - enable Synaptics trackpoints scripts/dtc: only append to HOST_EXTRACFLAGS instead of overwriting random32: Restore __latent_entropy attribute on net_rand_state block/diskstats: more accurate approximation of io_ticks for slow disks mm: replace memmap_context by meminit_context mm: don't rely on system state to detect hot-plug operations nvme: Cleanup and rename nvme_block_nr() nvme: Introduce nvme_lba_to_sect() nvme: consolidate chunk_sectors settings epoll: do not insert into poll queues until all sanity checks are done epoll: replace ->visited/visited_list with generation count epoll: EPOLL_CTL_ADD: close the race in decision to take fast path ep_create_wakeup_source(): dentry name can change under you... netfilter: ctnetlink: add a range check for l3/l4 protonum Linux 5.4.70 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ic4aba5748f2156f83a29e1d0b637c31bbdc12370	2020-10-07 08:50:29 +02:00
Laurent Dufour	42b7153dd6	mm: replace memmap_context by meminit_context commit `c1d0da8335` upstream. Patch series "mm: fix memory to node bad links in sysfs", v3. Sometimes, firmware may expose interleaved memory layout like this: Early memory node ranges node 1: [mem 0x0000000000000000-0x000000011fffffff] node 2: [mem 0x0000000120000000-0x000000014fffffff] node 1: [mem 0x0000000150000000-0x00000001ffffffff] node 0: [mem 0x0000000200000000-0x000000048fffffff] node 2: [mem 0x0000000490000000-0x00000007ffffffff] In that case, we can see memory blocks assigned to multiple nodes in sysfs: $ ls -l /sys/devices/system/memory/memory21 total 0 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2 -rw-r--r-- 1 root root 65536 Aug 24 05:27 online -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index drwxr-xr-x 2 root root 0 Aug 24 05:27 power -r--r--r-- 1 root root 65536 Aug 24 05:27 removable -rw-r--r-- 1 root root 65536 Aug 24 05:27 state lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones The same applies in the node's directory with a memory21 link in both the node1 and node2's directory. This is wrong but doesn't prevent the system to run. However when later, one of these memory blocks is hot-unplugged and then hot-plugged, the system is detecting an inconsistency in the sysfs layout and a BUG_ON() is raised: kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084! LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4 CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25 Call Trace: add_memory_resource+0x23c/0x340 (unreliable) __add_memory+0x5c/0xf0 dlpar_add_lmb+0x1b4/0x500 dlpar_memory+0x1f8/0xb80 handle_dlpar_errorlog+0xc0/0x190 dlpar_store+0x198/0x4a0 kobj_attr_store+0x30/0x50 sysfs_kf_write+0x64/0x90 kernfs_fop_write+0x1b0/0x290 vfs_write+0xe8/0x290 ksys_write+0xdc/0x130 system_call_exception+0x160/0x270 system_call_common+0xf0/0x27c This has been seen on PowerPC LPAR. The root cause of this issue is that when node's memory is registered, the range used can overlap another node's range, thus the memory block is registered to multiple nodes in sysfs. There are two issues here: (a) The sysfs memory and node's layouts are broken due to these multiple links (b) The link errors in link_mem_sections() should not lead to a system panic. To address (a) register_mem_sect_under_node should not rely on the system state to detect whether the link operation is triggered by a hot plug operation or not. This is addressed by the patches 1 and 2 of this series. Issue (b) will be addressed separately. This patch (of 2): The memmap_context enum is used to detect whether a memory operation is due to a hot-add operation or happening at boot time. Make it general to the hotplug operation and rename it as meminit_context. There is no functional change introduced by this patch Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J . Wysocki" <rafael@kernel.org> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-10-07 08:01:29 +02:00
Greg Kroah-Hartman	d6430e6763	Merge 5.4.61 into android11-5.4 Changes in 5.4.61 Documentation/llvm: add documentation on building w/ Clang/LLVM Documentation/llvm: fix the name of llvm-size net: wan: wanxl: use allow to pass CROSS_COMPILE_M68k for rebuilding firmware net: wan: wanxl: use $(M68KCC) instead of $(M68KAS) for rebuilding firmware x86/boot: kbuild: allow readelf executable to be specified kbuild: remove PYTHON2 variable kbuild: remove AS variable kbuild: replace AS=clang with LLVM_IAS=1 kbuild: support LLVM=1 to switch the default tools to Clang/LLVM drm/vgem: Replace opencoded version of drm_gem_dumb_map_offset() gfs2: Improve mmap write vs. punch_hole consistency gfs2: Never call gfs2_block_zero_range with an open transaction perf probe: Fix memory leakage when the probe point is not found khugepaged: khugepaged_test_exit() check mmget_still_valid() khugepaged: adjust VM_BUG_ON_MM() in __khugepaged_enter() bcache: avoid nr_stripes overflow in bcache_device_init() btrfs: export helpers for subvolume name/id resolution btrfs: don't show full path of bind mounts in subvol= btrfs: return EROFS for BTRFS_FS_STATE_ERROR cases btrfs: add wrapper for transaction abort predicate ALSA: hda/realtek: Add quirk for Samsung Galaxy Flex Book ALSA: hda/realtek: Add quirk for Samsung Galaxy Book Ion can: j1939: transport: j1939_session_tx_dat(): fix use-after-free read in j1939_tp_txtimer() can: j1939: socket: j1939_sk_bind(): make sure ml_priv is allocated spi: Prevent adding devices below an unregistering controller romfs: fix uninitialized memory leak in romfs_dev_read() kernel/relay.c: fix memleak on destroy relay channel uprobes: __replace_page() avoid BUG in munlock_vma_page() mm: include CMA pages in lowmem_reserve at boot mm, page_alloc: fix core hung in free_pcppages_bulk() RDMA/hfi1: Correct an interlock issue for TID RDMA WRITE request ext4: fix checking of directory entry validity for inline directories jbd2: add the missing unlock_buffer() in the error path of jbd2_write_superblock() scsi: zfcp: Fix use-after-free in request timeout handlers drm/amdgpu/display: use GFP_ATOMIC in dcn20_validate_bandwidth_internal drm/amd/display: Fix EDID parsing after resume from suspend drm/amd/display: fix pow() crashing when given base 0 kthread: Do not preempt current task if it is going to call schedule() opp: Enable resources again if they were disabled earlier scsi: ufs: Add DELAY_BEFORE_LPM quirk for Micron devices scsi: target: tcmu: Fix crash in tcmu_flush_dcache_range on ARM media: budget-core: Improve exception handling in budget_register() rtc: goldfish: Enable interrupt in set_alarm() when necessary media: vpss: clean up resources in init Input: psmouse - add a newline when printing 'proto' by sysfs MIPS: Fix unable to reserve memory for Crash kernel m68knommu: fix overwriting of bits in ColdFire V3 cache control svcrdma: Fix another Receive buffer leak xfs: fix inode quota reservation checks drm/ttm: fix offset in VMAs with a pg_offs in ttm_bo_vm_access jffs2: fix UAF problem ceph: fix use-after-free for fsc->mdsc swiotlb-xen: use vmalloc_to_page on vmalloc virt addresses cpufreq: intel_pstate: Fix cpuinfo_max_freq when MSR_TURBO_RATIO_LIMIT is 0 scsi: libfc: Free skb in fc_disc_gpn_id_resp() for valid cases virtio_ring: Avoid loop when vq is broken in virtqueue_poll media: camss: fix memory leaks on error handling paths in probe tools/testing/selftests/cgroup/cgroup_util.c: cg_read_strcmp: fix null pointer dereference xfs: Fix UBSAN null-ptr-deref in xfs_sysfs_init alpha: fix annotation of io{read,write}{16,32}be() fs/signalfd.c: fix inconsistent return codes for signalfd4 ext4: fix potential negative array index in do_split() ext4: don't allow overlapping system zones netfilter: nf_tables: nft_exthdr: the presence return value should be little-endian spi: stm32: fixes suspend/resume management ASoC: q6afe-dai: mark all widgets registers as SND_SOC_NOPM ASoC: q6routing: add dummy register read/write function bpf: sock_ops sk access may stomp registers when dst_reg = src_reg can: j1939: fix kernel-infoleak in j1939_sk_sock2sockaddr_can() can: j1939: transport: j1939_simple_recv(): ignore local J1939 messages send not by J1939 stack can: j1939: transport: add j1939_session_skb_find_by_offset() function i40e: Set RX_ONLY mode for unicast promiscuous on VLAN i40e: Fix crash during removing i40e driver net: fec: correct the error path for regulator disable in probe bonding: show saner speed for broadcast mode can: j1939: fix support for multipacket broadcast message can: j1939: cancel rxtimer on multipacket broadcast session complete can: j1939: abort multipacket broadcast session when timeout occurs can: j1939: add rxtimer for multipacket broadcast session bonding: fix a potential double-unregister s390/runtime_instrumentation: fix storage key handling s390/ptrace: fix storage key handling ASoC: msm8916-wcd-analog: fix register Interrupt offset ASoC: intel: Fix memleak in sst_media_open vfio/type1: Add proper error unwind for vfio_iommu_replay() kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode Revert "scsi: qla2xxx: Disable T10-DIF feature with FC-NVMe during probe" kconfig: qconf: do not limit the pop-up menu to the first row kconfig: qconf: fix signal connection to invalid slots efi: avoid error message when booting under Xen Fix build error when CONFIG_ACPI is not set/enabled: RDMA/bnxt_re: Do not add user qps to flushlist afs: Fix NULL deref in afs_dynroot_depopulate() ARM64: vdso32: Install vdso32 from vdso_install bonding: fix active-backup failover for current ARP slave net: ena: Prevent reset after device destruction net: gemini: Fix missing free_netdev() in error path of gemini_ethernet_port_probe() hv_netvsc: Fix the queue_mapping in netvsc_vf_xmit() net: dsa: b53: check for timeout powerpc/pseries: Do not initiate shutdown when system is running on UPS efi: add missed destroy_workqueue when efisubsys_init fails epoll: Keep a reference on files added to the check list do_epoll_ctl(): clean the failure exits up a bit mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible xen: don't reschedule in preemption off sections KVM: Pass MMU notifier range flags to kvm_unmap_hva_range() KVM: arm64: Only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is not set Linux 5.4.61 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I316392535e50265bbcce6a56d3e10586d1ed29cc	2020-08-26 11:11:38 +02:00
Charan Teja Reddy	0cfb9320d0	mm, page_alloc: fix core hung in free_pcppages_bulk() commit `88e8ac11d2` upstream. The following race is observed with the repeated online, offline and a delay between two successive online of memory blocks of movable zone. P1 P2 Online the first memory block in the movable zone. The pcp struct values are initialized to default values,i.e., pcp->high = 0 & pcp->batch = 1. Allocate the pages from the movable zone. Try to Online the second memory block in the movable zone thus it entered the online_pages() but yet to call zone_pcp_update(). This process is entered into the exit path thus it tries to release the order-0 pages to pcp lists through free_unref_page_commit(). As pcp->high = 0, pcp->count = 1 proceed to call the function free_pcppages_bulk(). Update the pcp values thus the new pcp values are like, say, pcp->high = 378, pcp->batch = 63. Read the pcp's batch value using READ_ONCE() and pass the same to free_pcppages_bulk(), pcp values passed here are, batch = 63, count = 1. Since num of pages in the pcp lists are less than ->batch, then it will stuck in while(list_empty(list)) loop with interrupts disabled thus a core hung. Avoid this by ensuring free_pcppages_bulk() is called with proper count of pcp list pages. The mentioned race is some what easily reproducible without [1] because pcp's are not updated for the first memory block online and thus there is a enough race window for P2 between alloc+free and pcp struct values update through onlining of second memory block. With [1], the race still exists but it is very narrow as we update the pcp struct values for the first memory block online itself. This is not limited to the movable zone, it could also happen in cases with the normal zone (e.g., hotplug to a node that only has DMA memory, or no other memory yet). [1]: https://patchwork.kernel.org/patch/11696389/ Fixes: `5f8dcc2121` ("page-allocator: split per-cpu list into one-list-per-migrate-type") Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: <stable@vger.kernel.org> [2.6+] Link: http://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-08-26 10:40:51 +02:00
Doug Berger	5663159e29	mm: include CMA pages in lowmem_reserve at boot commit `e08d3fdfe2` upstream. The lowmem_reserve arrays provide a means of applying pressure against allocations from lower zones that were targeted at higher zones. Its values are a function of the number of pages managed by higher zones and are assigned by a call to the setup_per_zone_lowmem_reserve() function. The function is initially called at boot time by the function init_per_zone_wmark_min() and may be called later by accesses of the /proc/sys/vm/lowmem_reserve_ratio sysctl file. The function init_per_zone_wmark_min() was moved up from a module_init to a core_initcall to resolve a sequencing issue with khugepaged. Unfortunately this created a sequencing issue with CMA page accounting. The CMA pages are added to the managed page count of a zone when cma_init_reserved_areas() is called at boot also as a core_initcall. This makes it uncertain whether the CMA pages will be added to the managed page counts of their zones before or after the call to init_per_zone_wmark_min() as it becomes dependent on link order. With the current link order the pages are added to the managed count after the lowmem_reserve arrays are initialized at boot. This means the lowmem_reserve values at boot may be lower than the values used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the ratio values are unchanged. In many cases the difference is not significant, but for example an ARM platform with 1GB of memory and the following memory layout cma: Reserved 256 MiB at 0x0000000030000000 Zone ranges: DMA [mem 0x0000000000000000-0x000000002fffffff] Normal empty HighMem [mem 0x0000000030000000-0x000000003fffffff] would result in 0 lowmem_reserve for the DMA zone. This would allow userspace to deplete the DMA zone easily. Funnily enough $ cat /proc/sys/vm/lowmem_reserve_ratio would fix up the situation because as a side effect it forces setup_per_zone_lowmem_reserve. This commit breaks the link order dependency by invoking init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages have the chance to be properly accounted in their zone(s) and allowing the lowmem_reserve arrays to receive consistent values. Fixes: `bc22af74f2` ("mm: update min_free_kbytes from khugepaged after core initialization") Signed-off-by: Doug Berger <opendmb@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jason Baron <jbaron@akamai.com> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1597423766-27849-1-git-send-email-opendmb@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-08-26 10:40:51 +02:00
Greg Kroah-Hartman	fa46997961	Merge 5.4.48 into android-5.4-stable Changes in 5.4.48 ACPI: GED: use correct trigger type field in _Exx / _Lxx handling drm/amdgpu: fix and cleanup amdgpu_gem_object_close v4 ath10k: Fix the race condition in firmware dump work queue drm: bridge: adv7511: Extend list of audio sample rates media: staging: imgu: do not hold spinlock during freeing mmu page table media: imx: imx7-mipi-csis: Cleanup and fix subdev pad format handling crypto: ccp -- don't "select" CONFIG_DMADEVICES media: vicodec: Fix error codes in probe function media: si2157: Better check for running tuner in init objtool: Ignore empty alternatives spi: spi-mem: Fix Dual/Quad modes on Octal-capable devices drm/amdgpu: Init data to avoid oops while reading pp_num_states. arm64/kernel: Fix range on invalidating dcache for boot page tables libbpf: Fix memory leak and possible double-free in hashmap__clear spi: pxa2xx: Apply CS clk quirk to BXT x86,smap: Fix smap_{save,restore}() alternatives sched/fair: Refill bandwidth before scaling net: atlantic: make hw_get_regs optional net: ena: fix error returning in ena_com_get_hash_function() efi/libstub/x86: Work around LLVM ELF quirk build regression ath10k: remove the max_sched_scan_reqs value arm64: cacheflush: Fix KGDB trap detection media: staging: ipu3: Fix stale list entries on parameter queue failure rtw88: fix an issue about leak system resources spi: dw: Zero DMA Tx and Rx configurations on stack ACPICA: Dispatcher: add status checks block: alloc map and request for new hardware queue arm64: insn: Fix two bugs in encoding 32-bit logical immediates block: reset mapping if failed to update hardware queue count drm: rcar-du: Set primary plane zpos immutably at initializing lockdown: Allow unprivileged users to see lockdown status ixgbe: Fix XDP redirect on archs with PAGE_SIZE above 4K platform/x86: dell-laptop: don't register micmute LED if there is no token MIPS: Loongson: Build ATI Radeon GPU driver as module Bluetooth: Add SCO fallback for invalid LMP parameters error kgdb: Disable WARN_CONSOLE_UNLOCKED for all kgdb kgdb: Prevent infinite recursive entries to the debugger pmu/smmuv3: Clear IRQ affinity hint on device removal ACPI/IORT: Fix PMCG node single ID mapping handling mips: Fix cpu_has_mips64r1/2 activation for MIPS32 CPUs spi: dw: Enable interrupts in accordance with DMA xfer mode clocksource: dw_apb_timer: Make CPU-affiliation being optional clocksource: dw_apb_timer_of: Fix missing clockevent timers media: dvbdev: Fix tuner->demod media controller link btrfs: account for trans_block_rsv in may_commit_transaction btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums ARM: 8978/1: mm: make act_mm() respect THREAD_SIZE batman-adv: Revert "disable ethtool link speed detection when auto negotiation off" ice: Fix memory leak ice: Fix for memory leaks and modify ICE_FREE_CQ_BUFS mmc: meson-mx-sdio: trigger a soft reset after a timeout or CRC error Bluetooth: btmtkuart: Improve exception handling in btmtuart_probe() spi: dw: Fix Rx-only DMA transfers x86/kvm/hyper-v: Explicitly align hcall param for kvm_hyperv_exit net: vmxnet3: fix possible buffer overflow caused by bad DMA value in vmxnet3_get_rss() x86: fix vmap arguments in map_irq_stack staging: android: ion: use vmap instead of vm_map_ram ath10k: fix kernel null pointer dereference media: staging/intel-ipu3: Implement lock for stream on/off operations spi: Respect DataBitLength field of SpiSerialBusV2() ACPI resource brcmfmac: fix wrong location to get firmware feature regulator: qcom-rpmh: Fix typos in pm8150 and pm8150l tools api fs: Make xxx__mountpoint() more scalable e1000: Distribute switch variables for initialization dt-bindings: display: mediatek: control dpi pins mode to avoid leakage drm/mediatek: set dpi pin mode to gpio low to avoid leakage current audit: fix a net reference leak in audit_send_reply() media: dvb: return -EREMOTEIO on i2c transfer failure. media: platform: fcp: Set appropriate DMA parameters MIPS: Make sparse_init() using top-down allocation ath10k: add flush tx packets for SDIO chip Bluetooth: btbcm: Add 2 missing models to subver tables audit: fix a net reference leak in audit_list_rules_send() Drivers: hv: vmbus: Always handle the VMBus messages on CPU0 dpaa2-eth: fix return codes used in ndo_setup_tc netfilter: nft_nat: return EOPNOTSUPP if type or flags are not supported selftests/bpf: Fix memory leak in extract_build_id() net: bcmgenet: set Rx mode before starting netif net: bcmgenet: Fix WoL with password after deep sleep lib/mpi: Fix 64-bit MIPS build with Clang exit: Move preemption fixup up, move blocking operations down sched/core: Fix illegal RCU from offline CPUs drivers/perf: hisi: Fix typo in events attribute array iocost_monitor: drop string wrap around numbers when outputting json net: lpc-enet: fix error return code in lpc_mii_init() selinux: fix error return code in policydb_read() drivers: net: davinci_mdio: fix potential NULL dereference in davinci_mdio_probe() media: cec: silence shift wrapping warning in __cec_s_log_addrs() net: allwinner: Fix use correct return type for ndo_start_xmit() powerpc/spufs: fix copy_to_user while atomic libertas_tf: avoid a null dereference in pointer priv xfs: clean up the error handling in xfs_swap_extents Crypto/chcr: fix for ccm(aes) failed test MIPS: Truncate link address into 32bit for 32bit kernel mips: cm: Fix an invalid error code of INTVN__ERR kgdb: Fix spurious true from in_dbg_master() xfs: reset buffer write failure state on successful completion xfs: fix duplicate verification from xfs_qm_dqflush() platform/x86: intel-vbtn: Use acpi_evaluate_integer() platform/x86: intel-vbtn: Split keymap into buttons and switches parts platform/x86: intel-vbtn: Do not advertise switches to userspace if they are not there platform/x86: intel-vbtn: Also handle tablet-mode switch on "Detachable" and "Portable" chassis-types iwlwifi: avoid debug max amsdu config overwriting itself nvme: refine the Qemu Identify CNS quirk nvme-pci: align io queue count with allocted nvme_queue in nvme_probe nvme-tcp: use bh_lock in data_ready ath10k: Remove msdu from idr when management pkt send fails wcn36xx: Fix error handling path in 'wcn36xx_probe()' net: qed: Reduce RX and TX default ring count when running inside kdump kernel drm/mcde: dsi: Fix return value check in mcde_dsi_bind() mt76: avoid rx reorder buffer overflow md: don't flush workqueue unconditionally in md_open raid5: remove gfp flags from scribble_alloc() iocost: don't let vrate run wild while there's no saturation signal veth: Adjust hard_start offset on redirect XDP frames net/mlx5e: IPoIB, Drop multicast packets that this interface sent rtlwifi: Fix a double free in _rtl_usb_tx_urb_setup() mwifiex: Fix memory corruption in dump_station kgdboc: Use a platform device to handle tty drivers showing up late x86/boot: Correct relocation destination on old linkers sched: Defend cfs and rt bandwidth quota against overflow mips: MAAR: Use more precise address mask mips: Add udelay lpj numbers adjustment crypto: stm32/crc32 - fix ext4 chksum BUG_ON() crypto: stm32/crc32 - fix run-time self test issue. crypto: stm32/crc32 - fix multi-instance drm/amd/powerpay: Disable gfxoff when setting manual mode on picasso and raven drm/amdgpu: Sync with VM root BO when switching VM to CPU update mode selftests/bpf: CONFIG_IPV6_SEG6_BPF required for test_seg6_loop.o x86/mm: Stop printing BRK addresses MIPS: tools: Fix resource leak in elf-entry.c m68k: mac: Don't call via_flush_cache() on Mac IIfx btrfs: improve global reserve stealing logic btrfs: qgroup: mark qgroup inconsistent if we're inherting snapshot to a new qgroup macvlan: Skip loopback packets in RX handler PCI: Don't disable decoding when mmio_always_on is set MIPS: Fix IRQ tracing when call handle_fpe() and handle_msa_fpe() bcache: fix refcount underflow in bcache_device_free() mmc: sdhci-msm: Set SDHCI_QUIRK_MULTIBLOCK_READ_ACMD12 quirk staging: greybus: sdio: Respect the cmd->busy_timeout from the mmc core mmc: via-sdmmc: Respect the cmd->busy_timeout from the mmc core ice: fix potential double free in probe unrolling ixgbe: fix signed-integer-overflow warning iwlwifi: mvm: fix aux station leak mmc: sdhci-esdhc-imx: fix the mask for tuning start point spi: dw: Return any value retrieved from the dma_transfer callback cpuidle: Fix three reference count leaks platform/x86: hp-wmi: Convert simple_strtoul() to kstrtou32() platform/x86: intel-hid: Add a quirk to support HP Spectre X2 (2015) platform/x86: intel-vbtn: Only blacklist SW_TABLET_MODE on the 9 / "Laptop" chasis-type platform/x86: asus_wmi: Reserve more space for struct bias_args libbpf: Fix perf_buffer__free() API for sparse allocs bpf: Fix map permissions check bpf: Refactor sockmap redirect code so its easy to reuse bpf: Fix running sk_skb program types with ktls selftests/bpf, flow_dissector: Close TAP device FD after the test kasan: stop tests being eliminated as dead code with FORTIFY_SOURCE string.h: fix incompatibility between FORTIFY_SOURCE and KASAN btrfs: free alien device after device add btrfs: include non-missing as a qualifier for the latest_bdev btrfs: send: emit file capabilities after chown btrfs: force chunk allocation if our global rsv is larger than metadata btrfs: fix error handling when submitting direct I/O bio btrfs: fix wrong file range cleanup after an error filling dealloc range btrfs: fix space_info bytes_may_use underflow after nocow buffered write btrfs: fix space_info bytes_may_use underflow during space cache writeout powerpc/mm: Fix conditions to perform MMU specific management by blocks on PPC32. mm: thp: make the THP mapcount atomic against __split_huge_pmd_locked() mm: initialize deferred pages with interrupts enabled mm/pagealloc.c: call touch_nmi_watchdog() on max order boundaries in deferred init mm: call cond_resched() from deferred_init_memmap() ima: Fix ima digest hash table key calculation ima: Switch to ima_hash_algo for boot aggregate ima: Evaluate error in init_ima() ima: Directly assign the ima_default_policy pointer to ima_rules ima: Call ima_calc_boot_aggregate() in ima_eventdigest_init() ima: Remove __init annotation from ima_pcrread() evm: Fix possible memory leak in evm_calc_hmac_or_hash() ext4: fix EXT_MAX_EXTENT/INDEX to check for zeroed eh_max ext4: fix error pointer dereference ext4: fix race between ext4_sync_parent() and rename() PCI: Avoid Pericom USB controller OHCI/EHCI PME# defect PCI: Avoid FLR for AMD Matisse HD Audio & USB 3.0 PCI: Avoid FLR for AMD Starship USB 3.0 PCI: Add ACS quirk for Intel Root Complex Integrated Endpoints PCI: vmd: Add device id for VMD device 8086:9A0B x86/amd_nb: Add Family 19h PCI IDs PCI: Add Loongson vendor ID serial: 8250_pci: Move Pericom IDs to pci_ids.h x86/amd_nb: Add AMD family 17h model 60h PCI IDs ima: Remove redundant policy rule set in add_rules() ima: Set again build_ima_appraise variable PCI: Program MPS for RCiEP devices e1000e: Disable TSO for buffer overrun workaround e1000e: Relax condition to trigger reset for ME workaround carl9170: remove P2P_GO support media: go7007: fix a miss of snd_card_free media: cedrus: Program output format during each run serial: 8250: Avoid error message on reprobe Bluetooth: hci_bcm: fix freeing not-requested IRQ b43legacy: Fix case where channel status is corrupted b43: Fix connection problem with WPA3 b43_legacy: Fix connection problem with WPA3 media: ov5640: fix use of destroyed mutex clk: mediatek: assign the initial value to clk_init_data of mtk_mux igb: Report speed and duplex as unknown when device is runtime suspended hwmon: (k10temp) Add AMD family 17h model 60h PCI match EDAC/amd64: Add AMD family 17h model 60h PCI IDs power: vexpress: add suppress_bind_attrs to true power: supply: core: fix HWMON temperature labels power: supply: core: fix memory leak in HWMON error path pinctrl: samsung: Correct setting of eint wakeup mask on s5pv210 pinctrl: samsung: Save/restore eint_mask over suspend for EINT_TYPE GPIOs gnss: sirf: fix error return code in sirf_probe() sparc32: fix register window handling in genregs32_[gs]et() sparc64: fix misuses of access_process_vm() in genregs32_[sg]et() dm crypt: avoid truncating the logical block size alpha: fix memory barriers so that they conform to the specification powerpc/fadump: use static allocation for reserved memory ranges powerpc/fadump: consider reserved ranges while reserving memory powerpc/fadump: Account for memory_limit while reserving memory kernel/cpu_pm: Fix uninitted local in cpu_pm ARM: tegra: Correct PL310 Auxiliary Control Register initialization soc/tegra: pmc: Select GENERIC_PINCONF ARM: dts: exynos: Fix GPIO polarity for thr GalaxyS3 CM36651 sensor's bus ARM: dts: at91: sama5d2_ptc_ek: fix vbus pin ARM: dts: s5pv210: Set keep-power-in-suspend for SDHCI1 on Aries drivers/macintosh: Fix memleak in windfarm_pm112 driver powerpc/32s: Fix another build failure with CONFIG_PPC_KUAP_DEBUG powerpc/kasan: Fix issues by lowering KASAN_SHADOW_END powerpc/kasan: Fix shadow pages allocation failure powerpc/32: Disable KASAN with pages bigger than 16k powerpc/64s: Don't let DT CPU features set FSCR_DSCR powerpc/64s: Save FSCR to init_task.thread.fscr after feature init kbuild: force to build vmlinux if CONFIG_MODVERSION=y sunrpc: svcauth_gss_register_pseudoflavor must reject duplicate registrations. sunrpc: clean up properly in gss_mech_unregister() mtd: rawnand: Fix nand_gpio_waitrdy() mtd: rawnand: onfi: Fix redundancy detection check mtd: rawnand: brcmnand: fix hamming oob layout mtd: rawnand: diskonchip: Fix the probe error path mtd: rawnand: sharpsl: Fix the probe error path mtd: rawnand: ingenic: Fix the probe error path mtd: rawnand: xway: Fix the probe error path mtd: rawnand: orion: Fix the probe error path mtd: rawnand: socrates: Fix the probe error path mtd: rawnand: oxnas: Fix the probe error path mtd: rawnand: sunxi: Fix the probe error path mtd: rawnand: plat_nand: Fix the probe error path mtd: rawnand: pasemi: Fix the probe error path mtd: rawnand: mtk: Fix the probe error path mtd: rawnand: tmio: Fix the probe error path w1: omap-hdq: cleanup to add missing newline for some dev_dbg f2fs: fix checkpoint=disable:%u%% perf probe: Do not show the skipped events perf probe: Fix to check blacklist address correctly perf probe: Check address correctness by map instead of _etext perf symbols: Fix debuginfo search for Ubuntu perf symbols: Fix kernel maps for kcore and eBPF Linux 5.4.48 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I9954fb3f08956419e8586bcb9078e604df207fb9	2020-06-22 11:43:59 +02:00
Pavel Tatashin	13ae9eaae0	mm: call cond_resched() from deferred_init_memmap() commit `da97f2d56b` upstream. Now that deferred pages are initialized with interrupts enabled we can replace touch_nmi_watchdog() with cond_resched(), as it was before `3a2d7fa8a3`. For now, we cannot do the same in deferred_grow_zone() as it is still initializes pages with interrupts disabled. This change fixes RCU problem described in https://lkml.kernel.org/r/20200401104156.11564-2-david@redhat.com [ 60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [ 60.475000] rcu: 1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000 [ 60.475000] rcu: (detected by 0, t=60002 jiffies, g=-1199, q=1) [ 60.475000] Sending NMI from CPU 0 to CPUs 1: [ 1.760091] NMI backtrace for cpu 1 [ 1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1 [ 1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014 [ 1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f [ 1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab <b8> 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41 [ 1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006 [ 1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000 [ 1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340 [ 1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002 [ 1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002 [ 1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c [ 1.760091] FS: 0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000 [ 1.760091] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0 [ 1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1.760091] Call Trace: [ 1.760091] deferred_init_pages+0x8f/0xbf [ 1.760091] deferred_init_memmap+0x184/0x29d [ 1.760091] ? deferred_free_pages.isra.97+0xba/0xba [ 1.760091] kthread+0x112/0x130 [ 1.760091] ? kthread_flush_work_fn+0x10/0x10 [ 1.760091] ret_from_fork+0x35/0x40 [ 89.123011] node 0 initialised, 1055935372 pages in 88650ms Fixes: `3a2d7fa8a3` ("mm: disable interrupts while initializing deferred pages") Reported-by: Yiqian Wei <yiwei@redhat.com> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Sasha Levin <sashal@kernel.org> Cc: Shile Zhang <shile.zhang@linux.alibaba.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [4.17+] Link: http://lkml.kernel.org/r/20200403140952.17177-4-pasha.tatashin@soleen.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-06-22 09:31:14 +02:00
Daniel Jordan	5386d93bc5	mm/pagealloc.c: call touch_nmi_watchdog() on max order boundaries in deferred init commit `117003c327` upstream. Patch series "initialize deferred pages with interrupts enabled", v4. Keep interrupts enabled during deferred page initialization in order to make code more modular and allow jiffies to update. Original approach, and discussion can be found here: http://lkml.kernel.org/r/20200311123848.118638-1-shile.zhang@linux.alibaba.com This patch (of 3): deferred_init_memmap() disables interrupts the entire time, so it calls touch_nmi_watchdog() periodically to avoid soft lockup splats. Soon it will run with interrupts enabled, at which point cond_resched() should be used instead. deferred_grow_zone() makes the same watchdog calls through code shared with deferred init but will continue to run with interrupts disabled, so it can't call cond_resched(). Pull the watchdog calls up to these two places to allow the first to be changed later, independently of the second. The frequency reduces from twice per pageblock (init and free) to once per max order block. Fixes: `3a2d7fa8a3` ("mm: disable interrupts while initializing deferred pages") Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Shile Zhang <shile.zhang@linux.alibaba.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: James Morris <jmorris@namei.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Yiqian Wei <yiwei@redhat.com> Cc: <stable@vger.kernel.org> [4.17+] Link: http://lkml.kernel.org/r/20200403140952.17177-2-pasha.tatashin@soleen.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-06-22 09:31:14 +02:00
Pavel Tatashin	c388f173ed	mm: initialize deferred pages with interrupts enabled commit `3d060856ad` upstream. Initializing struct pages is a long task and keeping interrupts disabled for the duration of this operation introduces a number of problems. 1. jiffies are not updated for long period of time, and thus incorrect time is reported. See proposed solution and discussion here: lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com 2. It prevents farther improving deferred page initialization by allowing intra-node multi-threading. We are keeping interrupts disabled to solve a rather theoretical problem that was never observed in real world (See `3a2d7fa8a3`). Let's keep interrupts enabled. In case we ever encounter a scenario where an interrupt thread wants to allocate large amount of memory this early in boot we can deal with that by growing zone (see deferred_grow_zone()) by the needed amount before starting deferred_init_memmap() threads. Before: [ 1.232459] node 0 initialised, 12058412 pages in 1ms After: [ 1.632580] node 0 initialised, 12051227 pages in 436ms Fixes: `3a2d7fa8a3` ("mm: disable interrupts while initializing deferred pages") Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Sasha Levin <sashal@kernel.org> Cc: Yiqian Wei <yiwei@redhat.com> Cc: <stable@vger.kernel.org> [4.17+] Link: http://lkml.kernel.org/r/20200403140952.17177-3-pasha.tatashin@soleen.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-06-22 09:31:14 +02:00
Greg Kroah-Hartman	5e169f689f	Merge 5.4.41 into android-5.4-stable Changes in 5.4.41 USB: serial: qcserial: Add DW5816e support nvme: refactor nvme_identify_ns_descs error handling nvme: fix possible hang when ns scanning fails during error recovery tracing/kprobes: Fix a double initialization typo net: macb: Fix runtime PM refcounting drm/amdgpu: move kfd suspend after ip_suspend_phase1 drm/amdgpu: drop redundant cg/pg ungate on runpm enter vt: fix unicode console freeing with a common interface tty: xilinx_uartps: Fix missing id assignment to the console devlink: fix return value after hitting end in region read dp83640: reverse arguments to list_add_tail fq_codel: fix TCA_FQ_CODEL_DROP_BATCH_SIZE sanity checks ipv6: Use global sernum for dst validation with nexthop objects mlxsw: spectrum_acl_tcam: Position vchunk in a vregion list properly neigh: send protocol value in neighbor create notification net: dsa: Do not leave DSA master with NULL netdev_ops net: macb: fix an issue about leak related system resources net: macsec: preserve ingress frame ordering net/mlx4_core: Fix use of ENOSPC around mlx4_counter_alloc() net_sched: sch_skbprio: add message validation to skbprio_change() net: stricter validation of untrusted gso packets net: tc35815: Fix phydev supported/advertising mask net/tls: Fix sk_psock refcnt leak in bpf_exec_tx_verdict() net/tls: Fix sk_psock refcnt leak when in tls_data_ready() net: usb: qmi_wwan: add support for DW5816e nfp: abm: fix a memory leak bug sch_choke: avoid potential panic in choke_reset() sch_sfq: validate silly quantum values tipc: fix partial topology connection closure tunnel: Propagate ECT(1) when decapsulating as recommended by RFC6040 bnxt_en: Fix VF anti-spoof filter setup. bnxt_en: Reduce BNXT_MSIX_VEC_MAX value to supported CQs per PF. bnxt_en: Improve AER slot reset. bnxt_en: Return error when allocating zero size context memory. bnxt_en: Fix VLAN acceleration handling in bnxt_fix_features(). net/mlx5: DR, On creation set CQ's arm_db member to right value net/mlx5: Fix forced completion access non initialized command entry net/mlx5: Fix command entry leak in Internal Error State net: mvpp2: prevent buffer overflow in mvpp22_rss_ctx() net: mvpp2: cls: Prevent buffer overflow in mvpp2_ethtool_cls_rule_del() HID: wacom: Read HID_DG_CONTACTMAX directly for non-generic devices sctp: Fix bundling of SHUTDOWN with COOKIE-ACK Revert "HID: wacom: generic: read the number of expected touches on a per collection basis" HID: usbhid: Fix race between usbhid_close() and usbhid_stop() HID: wacom: Report 2nd-gen Intuos Pro S center button status over BT USB: uas: add quirk for LaCie 2Big Quadra usb: chipidea: msm: Ensure proper controller reset using role switch API USB: serial: garmin_gps: add sanity checking for data length tracing: Add a vmalloc_sync_mappings() for safe measure crypto: arch/nhpoly1305 - process in explicit 4k chunks KVM: s390: Remove false WARN_ON_ONCE for the PQAP instruction KVM: VMX: Explicitly clear RFLAGS.CF and RFLAGS.ZF in VM-Exit RSB path KVM: arm: vgic: Fix limit condition when writing to GICD_I[CS]ACTIVER KVM: arm64: Fix 32bit PC wrap-around arm64: hugetlb: avoid potential NULL dereference drm: ingenic-drm: add MODULE_DEVICE_TABLE ipc/mqueue.c: change __do_notify() to bypass check_kill_permission() epoll: atomically remove wait entry on wake up eventpoll: fix missing wakeup for ovflist in ep_poll_callback mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous() mm: limit boost_watermark on small zones ceph: fix endianness bug when handling MDS session feature bits ceph: demote quotarealm lookup warning to a debug message staging: gasket: Check the return value of gasket_get_bar_index() coredump: fix crash when umh is disabled riscv: set max_pfn to the PFN of the last page iocost: protect iocg->abs_vdebt with iocg->waitq.lock batman-adv: fix batadv_nc_random_weight_tq batman-adv: Fix refcnt leak in batadv_show_throughput_override batman-adv: Fix refcnt leak in batadv_store_throughput_override batman-adv: Fix refcnt leak in batadv_v_ogm_process x86/entry/64: Fix unwind hints in register clearing code x86/entry/64: Fix unwind hints in kernel exit path x86/entry/64: Fix unwind hints in rewind_stack_do_exit() x86/unwind/orc: Don't skip the first frame for inactive tasks x86/unwind/orc: Prevent unwinding before ORC initialization x86/unwind/orc: Fix error path for bad ORC entry type x86/unwind/orc: Fix premature unwind stoppage due to IRET frames KVM: x86: Fixes posted interrupt check for IRQs delivery modes arch/x86/kvm/svm/sev.c: change flag passed to GUP fast in sev_pin_memory() netfilter: nat: never update the UDP checksum when it's 0 netfilter: nf_osf: avoid passing pointer to local var objtool: Fix stack offset tracking for indirect CFAs iommu/virtio: Reverse arguments to list_add scripts/decodecode: fix trapping instruction formatting mm, memcg: fix error return value of mem_cgroup_css_alloc() bdi: move bdi_dev_name out of line bdi: add a ->dev_name field to struct backing_dev_info fsnotify: replace inode pointer with an object id fanotify: merge duplicate events on parent and child Linux 5.4.41 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ie6695b1dace8ca62579a57084608e9268e52fde9	2020-05-14 08:55:48 +02:00
Henry Willard	e991f7ded4	mm: limit boost_watermark on small zones commit `14f69140ff` upstream. Commit `1c30844d2d` ("mm: reclaim small amounts of memory when an external fragmentation event occurs") adds a boost_watermark() function which increases the min watermark in a zone by at least pageblock_nr_pages or the number of pages in a page block. On Arm64, with 64K pages and 512M huge pages, this is 8192 pages or 512M. It does this regardless of the number of managed pages managed in the zone or the likelihood of success. This can put the zone immediately under water in terms of allocating pages from the zone, and can cause a small machine to fail immediately due to OoM. Unlike set_recommended_min_free_kbytes(), which substantially increases min_free_kbytes and is tied to THP, boost_watermark() can be called even if THP is not active. The problem is most likely to appear on architectures such as Arm64 where pageblock_nr_pages is very large. It is desirable to run the kdump capture kernel in as small a space as possible to avoid wasting memory. In some architectures, such as Arm64, there are restrictions on where the capture kernel can run, and therefore, the space available. A capture kernel running in 768M can fail due to OoM immediately after boost_watermark() sets the min in zone DMA32, where most of the memory is, to 512M. It fails even though there is over 500M of free memory. With boost_watermark() suppressed, the capture kernel can run successfully in 448M. This patch limits boost_watermark() to boosting a zone's min watermark only when there are enough pages that the boost will produce positive results. In this case that is estimated to be four times as many pages as pageblock_nr_pages. Mel said: : There is no harm in marking it stable. Clearly it does not happen very : often but it's not impossible. 32-bit x86 is a lot less common now : which would previously have been vulnerable to triggering this easily. : ppc64 has a larger base page size but typically only has one zone. : arm64 is likely the most vulnerable, particularly when CMA is : configured with a small movable zone. Fixes: `1c30844d2d` ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Signed-off-by: Henry Willard <henry.willard@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1588294148-6586-1-git-send-email-henry.willard@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-05-14 07:58:26 +02:00
David Hildenbrand	4b49a9660d	mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous() commit `e84fe99b68` upstream. Without CONFIG_PREEMPT, it can happen that we get soft lockups detected, e.g., while booting up. watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014 RIP: __pageblock_pfn_to_page+0x134/0x1c0 Call Trace: set_zone_contiguous+0x56/0x70 page_alloc_init_late+0x166/0x176 kernel_init_freeable+0xfa/0x255 kernel_init+0xa/0x106 ret_from_fork+0x35/0x40 The issue becomes visible when having a lot of memory (e.g., 4TB) assigned to a single NUMA node - a system that can easily be created using QEMU. Inside VMs on a hypervisor with quite some memory overcommit, this is fairly easy to trigger. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Shile Zhang <shile.zhang@linux.alibaba.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Shile Zhang <shile.zhang@linux.alibaba.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200416073417.5003-1-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-05-14 07:58:26 +02:00
Liam Mark	dd1c26b853	ANDROID: GKI: mm: add cma pcp list Add a cma pcp list in order to increase cma memory utilization. Increased cma memory utilization will improve overall memory utilization because free cma pages are ignored when memory reclaim is done with gfp mask GFP_KERNEL. Since most memory reclaim is done by kswapd, which uses a gfp mask of GFP_KERNEL, by increasing cma memory utilization we are therefore ensuring that less aggressive memory reclaim takes place. Increased cma memory utilization will improve performance, for example it will increase app concurrency. Change-Id: I809589a25c6abca51f1c963f118adfc78e955cf9 Signed-off-by: Liam Mark <lmark@codeaurora.org> [vinmenon@codeaurora.org: fix !CONFIG_CMA compile time issues] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> [swatsrid@codeaurora.org: Fix merge conflicts] Signed-off-by: Swathi Sridhar <swatsrid@codeaurora.org> (cherry picked from commit 0caf6be3e0842302ecd54c9e75943ca4133f4c7a) Signed-off-by: Mark Salyzyn <salyzyn@google.com> Bug: 150378964 Bug: 142290962 Signed-off-by: Todd Kjos <tkjos@google.com>	2020-04-03 14:52:15 -07:00
Mark Salyzyn	b9d3d8f1e9	ANDROID: GKI: cma: redirect page allocation to CMA CMA pages are designed to be used as fallback for movable allocations and cannot be used for non-movable allocations. If CMA pages are utilized poorly, non-movable allocations may end up getting starved if all regular movable pages are allocated and the only pages left are CMA. Always using CMA pages first creates unacceptable performance problems. As a midway alternative, use CMA pages for certain userspace allocations. The userspace pages can be migrated or dropped quickly which giving decent utilization. Change-Id: I6165dda01b705309eebabc6dfa67146b7a95c174 Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Heesub Shin <heesub.shin@samsung.com> [lauraa@codeaurora.org: Missing CONFIG_CMA guards, add commit text] Signed-off-by: Laura Abbott <lauraa@codeaurora.org> [lmark@codeaurora.org: resolve conflicts relating to MIGRATE_HIGHATOMIC] Signed-off-by: Liam Mark <lmark@codeaurora.org> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> [swatsrid@codeaurora.org: Fix merge conflicts] Signed-off-by: Swathi Sridhar <swatsrid@codeaurora.org> (cherry picked from commit 46f8fca539686ce8493ff82206f9de2d07c9d72c) Signed-off-by: Mark Salyzyn <salyzyn@google.com> Bug: 150378964 Bug: 142290962 [tkjos@google.com: ANDROID: Fix kernelci build-break on !CONFIG_CMA builds] Signed-off-by: Todd Kjos <tkjos@google.com>	2020-04-03 14:51:35 -07:00
Saravana Kannan	62a48a7e90	ANDROID: GKI: Revert "mm: unexport free_reserved_area" This reverts commit `afa0011289`. Vendors would still like to use the free_reserved_area() API. Bug: 146846822 [saravanak Changed EXPORT_SYMBOL() to EXPORT_SYMBOL_GPL()] Signed-off-by: Saravana Kannan <saravanak@google.com> Change-Id: Ic926e79270203cc0614b913d0bfd88aef093cdeb	2020-03-17 22:16:00 +00:00
Greg Kroah-Hartman	87acfa0267	Merge 5.4.19 into android-5.4 Changes in 5.4.19 sparc32: fix struct ipc64_perm type definition bnxt_en: Move devlink_register before registering netdev cls_rsvp: fix rsvp_policy gtp: use __GFP_NOWARN to avoid memalloc warning l2tp: Allow duplicate session creation with UDP net: hsr: fix possible NULL deref in hsr_handle_frame() net_sched: fix an OOB access in cls_tcindex net: stmmac: Delete txtimer in suspend() bnxt_en: Fix TC queue mapping. rxrpc: Fix use-after-free in rxrpc_put_local() rxrpc: Fix insufficient receive notification generation rxrpc: Fix missing active use pinning of rxrpc_local object rxrpc: Fix NULL pointer deref due to call->conn being cleared on disconnect tcp: clear tp->total_retrans in tcp_disconnect() tcp: clear tp->delivered in tcp_disconnect() tcp: clear tp->data_segs{in\|out} in tcp_disconnect() tcp: clear tp->segs_{in\|out} in tcp_disconnect() ionic: fix rxq comp packet type mask MAINTAINERS: correct entries for ISDN/mISDN section netdevsim: fix stack-out-of-bounds in nsim_dev_debugfs_init() bnxt_en: Fix logic that disables Bus Master during firmware reset. media: uvcvideo: Avoid cyclic entity chains due to malformed USB descriptors mfd: dln2: More sanity checking for endpoints netfilter: ipset: fix suspicious RCU usage in find_set_and_id ipc/msg.c: consolidate all xxxctl_down() functions tracing/kprobes: Have uname use __get_str() in print_fmt tracing: Fix sched switch start/stop refcount racy updates rcu: Use _ONCE() to protect lockless ->expmask accesses rcu: Avoid data-race in rcu_gp_fqs_check_wake() srcu: Apply _ONCE() to ->srcu_last_gp_end rcu: Use READ_ONCE() for ->expmask in rcu_read_unlock_special() nvmet: Fix error print message at nvmet_install_queue function nvmet: Fix controller use after free Bluetooth: btusb: fix memory leak on fw Bluetooth: btusb: Disable runtime suspend on Realtek devices brcmfmac: Fix memory leak in brcmf_usbdev_qinit usb: dwc3: gadget: Check END_TRANSFER completion usb: dwc3: gadget: Delay starting transfer usb: typec: tcpci: mask event interrupts when remove driver objtool: Silence build output usb: gadget: f_fs: set req->num_sgs as 0 for non-sg transfer usb: gadget: legacy: set max_speed to super-speed usb: gadget: f_ncm: Use atomic_t to track in-flight request usb: gadget: f_ecm: Use atomic_t to track in-flight request ALSA: usb-audio: Fix endianess in descriptor validation ALSA: usb-audio: Annotate endianess in Scarlett gen2 quirk ALSA: dummy: Fix PCM format loop in proc output memcg: fix a crash in wb_workfn when a device disappears mm/sparse.c: reset section's mem_map when fully deactivated mmc: sdhci-pci: Make function amd_sdhci_reset static utimes: Clamp the timestamps in notify_change() mm/memory_hotplug: fix remove_memory() lockdep splat mm: thp: don't need care deferred split queue in memcg charge move path mm: move_pages: report the number of non-attempted pages media/v4l2-core: set pages dirty upon releasing DMA buffers media: v4l2-core: compat: ignore native command codes media: v4l2-rect.h: fix v4l2_rect_map_inside() top/left adjustments lib/test_kasan.c: fix memory leak in kmalloc_oob_krealloc_more() irqdomain: Fix a memory leak in irq_domain_push_irq() x86/cpu: Update cached HLE state on write to TSX_CTRL_CPUID_CLEAR platform/x86: intel_scu_ipc: Fix interrupt support ALSA: hda: Apply aligned MMIO access only conditionally ALSA: hda: Add Clevo W65_67SB the power_save blacklist ALSA: hda: Add JasperLake PCI ID and codec vid arm64: acpi: fix DAIF manipulation with pNMI KVM: arm64: Correct PSTATE on exception entry KVM: arm/arm64: Correct CPSR on exception entry KVM: arm/arm64: Correct AArch32 SPSR on exception entry KVM: arm64: Only sign-extend MMIO up to register width MIPS: syscalls: fix indentation of the 'SYSNR' message MIPS: fix indentation of the 'RELOCS' message MIPS: boot: fix typo in 'vmlinux.lzma.its' target s390/mm: fix dynamic pagetable upgrade for hugetlbfs powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case powerpc/ptdump: Fix W+X verification powerpc/xmon: don't access ASDR in VMs powerpc/pseries: Advance pfn if section is not present in lmb_is_removable() powerpc/32s: Fix bad_kuap_fault() powerpc/32s: Fix CPU wake-up from sleep mode tracing: Fix now invalid var_ref_vals assumption in trace action PCI: tegra: Fix return value check of pm_runtime_get_sync() PCI: keystone: Fix outbound region mapping PCI: keystone: Fix link training retries initiation PCI: keystone: Fix error handling when "num-viewport" DT property is not populated mmc: spi: Toggle SPI polarity, do not hardcode it ACPI: video: Do not export a non working backlight interface on MSI MS-7721 boards ACPI / battery: Deal with design or full capacity being reported as -1 ACPI / battery: Use design-cap for capacity calculations if full-cap is not available ACPI / battery: Deal better with neither design nor full capacity not being reported alarmtimer: Unregister wakeup source when module get fails fscrypt: don't print name of busy file when removing key ubifs: don't trigger assertion on invalid no-key filename ubifs: Fix wrong memory allocation ubifs: Fix FS_IOC_SETFLAGS unexpectedly clearing encrypt flag ubifs: Fix deadlock in concurrent bulk-read and writepage mmc: sdhci-of-at91: fix memleak on clk_get failure ASoC: SOF: core: free trace on errors hv_balloon: Balloon up according to request page number mfd: axp20x: Mark AXP20X_VBUS_IPSOUT_MGMT as volatile nvmem: core: fix memory abort in cleanup path crypto: api - Check spawn->alg under lock in crypto_drop_spawn crypto: ccree - fix backlog memory leak crypto: ccree - fix AEAD decrypt auth fail crypto: ccree - fix pm wrongful error reporting crypto: ccree - fix FDE descriptor sequence crypto: ccree - fix PM race condition padata: Remove broken queue flushing fs: allow deduplication of eof block into the end of the destination file scripts/find-unused-docs: Fix massive false positives erofs: fix out-of-bound read for shifted uncompressed block scsi: megaraid_sas: Do not initiate OCR if controller is not in ready state scsi: qla2xxx: Fix mtcp dump collection failure cpupower: Revert library ABI changes from commit `ae2917093f` power: supply: axp20x_ac_power: Fix reporting online status power: supply: ltc2941-battery-gauge: fix use-after-free ovl: fix wrong WARN_ON() in ovl_cache_update_ino() ovl: fix lseek overflow on 32bit f2fs: choose hardlimit when softlimit is larger than hardlimit in f2fs_statfs_project() f2fs: fix miscounted block limit in f2fs_statfs_project() f2fs: code cleanup for f2fs_statfs_project() f2fs: fix dcache lookup of !casefolded directories f2fs: fix race conditions in ->d_compare() and ->d_hash() PM: core: Fix handling of devices deleted during system-wide resume cpufreq: Avoid creating excessively large stack frames of: Add OF_DMA_DEFAULT_COHERENT & select it on powerpc ARM: dma-api: fix max_pfn off-by-one error in __dma_supported() dm zoned: support zone sizes smaller than 128MiB dm space map common: fix to ensure new block isn't already in use dm writecache: fix incorrect flush sequence when doing SSD mode commit dm crypt: fix GFP flags passed to skcipher_request_alloc() dm crypt: fix benbi IV constructor crash if used in authenticated mode dm thin metadata: use pool locking at end of dm_pool_metadata_close dm: fix potential for q->make_request_fn NULL pointer scsi: qla2xxx: Fix stuck login session using prli_pend_timer ASoC: SOF: Introduce state machine for FW boot ASoC: SOF: core: release resources on errors in probe_continue tracing: Annotate ftrace_graph_hash pointer with __rcu tracing: Annotate ftrace_graph_notrace_hash pointer with __rcu ftrace: Add comment to why rcu_dereference_sched() is open coded ftrace: Protect ftrace_graph_hash with ftrace_sync crypto: pcrypt - Avoid deadlock by using per-instance padata queues btrfs: fix improper setting of scanned for range cyclic write cache pages btrfs: Handle another split brain scenario with metadata uuid feature riscv, bpf: Fix broken BPF tail calls selftests/bpf: Fix perf_buffer test on systems w/ offline CPUs bpf, devmap: Pass lockdep expression to RCU lists libbpf: Fix realloc usage in bpf_core_find_cands tc-testing: fix eBPF tests failure on linux fresh clones samples/bpf: Don't try to remove user's homedir on clean samples/bpf: Xdp_redirect_cpu fix missing tracepoint attach selftests/bpf: Fix test_attach_probe selftests/bpf: Skip perf hw events test if the setup disabled it selftests: bpf: Use a temporary file in test_sockmap selftests: bpf: Ignore FIN packets for reuseport tests crypto: api - fix unexpectedly getting generic implementation crypto: hisilicon - Use the offset fields in sqe to avoid need to split scatterlists crypto: ccp - set max RSA modulus size for v3 platform devices as well crypto: arm64/ghash-neon - bump priority to 150 crypto: pcrypt - Do not clear MAY_SLEEP flag in original request crypto: atmel-aes - Fix counter overflow in CTR mode crypto: api - Fix race condition in crypto_spawn_alg crypto: picoxcell - adjust the position of tasklet_init and fix missed tasklet_kill powerpc/futex: Fix incorrect user access blocking scsi: qla2xxx: Fix unbound NVME response length NFS: Fix memory leaks and corruption in readdir NFS: Directory page cache pages need to be locked when read nfsd: fix filecache lookup jbd2_seq_info_next should increase position index ext4: fix deadlock allocating crypto bounce page from mempool ext4: fix race conditions in ->d_compare() and ->d_hash() Btrfs: fix missing hole after hole punching and fsync when using NO_HOLES Btrfs: make deduplication with range including the last block work Btrfs: fix infinite loop during fsync after rename operations btrfs: set trans->drity in btrfs_commit_transaction btrfs: drop log root for dropped roots Btrfs: fix race between adding and putting tree mod seq elements and nodes btrfs: flush write bio if we loop in extent_write_cache_pages btrfs: Correctly handle empty trees in find_first_clear_extent_bit ARM: tegra: Enable PLLP bypass during Tegra124 LP1 iwlwifi: don't throw error when trying to remove IGTK mwifiex: fix unbalanced locking in mwifiex_process_country_ie() sunrpc: expiry_time should be seconds not timeval gfs2: fix gfs2_find_jhead that returns uninitialized jhead with seq 0 gfs2: move setting current->backing_dev_info gfs2: fix O_SYNC write handling drm: atmel-hlcdc: use double rate for pixel clock only if supported drm: atmel-hlcdc: enable clock before configuring timing engine drm: atmel-hlcdc: prefer a lower pixel-clock than requested drm/rect: Avoid division by zero media: iguanair: fix endpoint sanity check media: rc: ensure lirc is initialized before registering input device tools/kvm_stat: Fix kvm_exit filter name xen/balloon: Support xend-based toolstack take two watchdog: fix UAF in reboot notifier handling in watchdog core code bcache: add readahead cache policy options via sysfs interface eventfd: track eventfd_signal() recursion depth aio: prevent potential eventfd recursion on poll KVM: x86: Refactor picdev_write() to prevent Spectre-v1/L1TF attacks KVM: x86: Refactor prefix decoding to prevent Spectre-v1/L1TF attacks KVM: x86: Protect pmu_intel.c from Spectre-v1/L1TF attacks KVM: x86: Protect DR-based index computations from Spectre-v1/L1TF attacks KVM: x86: Protect kvm_lapic_reg_write() from Spectre-v1/L1TF attacks KVM: x86: Protect kvm_hv_msr_[get\|set]_crash_data() from Spectre-v1/L1TF attacks KVM: x86: Protect ioapic_write_indirect() from Spectre-v1/L1TF attacks KVM: x86: Protect MSR-based index computations in pmu.h from Spectre-v1/L1TF attacks KVM: x86: Protect ioapic_read_indirect() from Spectre-v1/L1TF attacks KVM: x86: Protect MSR-based index computations from Spectre-v1/L1TF attacks in x86.c KVM: x86: Protect x86_decode_insn from Spectre-v1/L1TF attacks KVM: x86: Protect MSR-based index computations in fixed_msr_to_seg_unit() from Spectre-v1/L1TF attacks KVM: x86: Fix potential put_fpu() w/o load_fpu() on MPX platform KVM: PPC: Book3S HV: Uninit vCPU if vcore creation fails KVM: PPC: Book3S PR: Free shared page if mmu initialization fails kvm/svm: PKU not currently supported x86/kvm: Be careful not to clear KVM_VCPU_FLUSH_TLB bit x86/kvm: Introduce kvm_(un)map_gfn() x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed x86/kvm: Cache gfn to pfn translation x86/KVM: Clean up host's steal time structure KVM: VMX: Add non-canonical check on writes to RTIT address MSRs KVM: x86: Don't let userspace set host-reserved cr4 bits KVM: x86: Free wbinvd_dirty_mask if vCPU creation fails KVM: x86: Handle TIF_NEED_FPU_LOAD in kvm_{load,put}_guest_fpu() KVM: x86: Ensure guest's FPU state is loaded when accessing for emulation KVM: x86: Revert "KVM: X86: Fix fpu state crash in kvm guest" KVM: s390: do not clobber registers during guest reset/store status ocfs2: fix oops when writing cloned file mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section arm64: dts: qcom: qcs404-evb: Set vdd_apc regulator in high power mode mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush clk: tegra: Mark fuse clock as critical drm/amd/dm/mst: Ignore payload update failures virtio-balloon: initialize all vq callbacks virtio-pci: check name when counting MSI-X vectors fix up iter on short count in fuse_direct_io() broken ping to ipv6 linklocal addresses on debian buster percpu: Separate decrypted varaibles anytime encryption can be enabled ASoC: meson: axg-fifo: fix fifo threshold setup scsi: qla2xxx: Fix the endianness of the qla82xx_get_fw_size() return type scsi: csiostor: Adjust indentation in csio_device_reset scsi: qla4xxx: Adjust indentation in qla4xxx_mem_free scsi: ufs: Recheck bkops level if bkops is disabled mtd: spi-nor: Split mt25qu512a (n25q512a) entry into two phy: qualcomm: Adjust indentation in read_poll_timeout ext2: Adjust indentation in ext2_fill_super powerpc/44x: Adjust indentation in ibm4xx_denali_fixup_memsize drm: msm: mdp4: Adjust indentation in mdp4_dsi_encoder_enable NFC: pn544: Adjust indentation in pn544_hci_check_presence ppp: Adjust indentation into ppp_async_input net: smc911x: Adjust indentation in smc911x_phy_configure net: tulip: Adjust indentation in {dmfe, uli526x}_init_module IB/mlx5: Fix outstanding_pi index for GSI qps IB/core: Fix ODP get user pages flow nfsd: fix delay timer on 32-bit architectures nfsd: fix jiffies/time_t mixup in LRU list nfsd: Return the correct number of bytes written to the file virtio-balloon: Fix memory leak when unloading while hinting is in progress virtio_balloon: Fix memory leaks on errors in virtballoon_probe() ubi: fastmap: Fix inverted logic in seen selfcheck ubi: Fix an error pointer dereference in error handling code ubifs: Fix memory leak from c->sup_node regulator: core: Add regulator_is_equal() helper ASoC: sgtl5000: Fix VDDA and VDDIO comparison bonding/alb: properly access headers in bond_alb_xmit() devlink: report 0 after hitting end in region read dpaa_eth: support all modes with rate adapting PHYs net: dsa: b53: Always use dev->vlan_enabled in b53_configure_vlan() net: dsa: bcm_sf2: Only 7278 supports 2Gb/sec IMP port net: dsa: microchip: enable module autoprobe net: mvneta: move rx_dropped and rx_errors in per-cpu stats net_sched: fix a resource leak in tcindex_set_parms() net: stmmac: fix a possible endless loop net: systemport: Avoid RBUF stuck in Wake-on-LAN mode net/mlx5: IPsec, Fix esp modify function attribute net/mlx5: IPsec, fix memory leak at mlx5_fpga_ipsec_delete_sa_ctx net: macb: Remove unnecessary alignment check for TSO net: macb: Limit maximum GEM TX length in TSO taprio: Fix enabling offload with wrong number of traffic classes taprio: Fix still allowing changing the flags during runtime taprio: Add missing policy validation for flags taprio: Use taprio_reset_tc() to reset Traffic Classes configuration taprio: Fix dropping packets when using taprio + ETF offloading ipv6/addrconf: fix potential NULL deref in inet6_set_link_af() qed: Fix timestamping issue for L2 unicast ptp packets. drop_monitor: Do not cancel uninitialized work item net/mlx5: Fix deadlock in fs_core net/mlx5: Deprecate usage of generic TLS HW capability bit ASoC: Intel: skl_hda_dsp_common: Fix global-out-of-bounds bug mfd: da9062: Fix watchdog compatible string mfd: rn5t618: Mark ADC control register volatile mfd: bd70528: Fix hour register mask x86/timer: Don't skip PIT setup when APIC is disabled or in legacy mode btrfs: use bool argument in free_root_pointers() btrfs: free block groups after free'ing fs trees drm/dp_mst: Remove VCPI while disabling topology mgr KVM: x86/mmu: Apply max PA check for MMIO sptes to 32-bit KVM KVM: x86: use CPUID to locate host page table reserved bits KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM KVM: x86: fix overlap between SPTE_MMIO_MASK and generation KVM: nVMX: vmread should not set rflags to specify success in case of #PF KVM: Use vcpu-specific gva->hva translation when querying host page size KVM: Play nice with read-only memslots when querying host page size cifs: fail i/o on soft mounts if sessionsetup errors out x86/apic/msi: Plug non-maskable MSI affinity race clocksource: Prevent double add_timer_on() for watchdog_timer perf/core: Fix mlock accounting in perf_mmap() rxrpc: Fix service call disconnection regulator fix for "regulator: core: Add regulator_is_equal() helper" powerpc/kuap: Fix set direction in allow/prevent_user_access() Linux 5.4.19 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ief6bae336b8e6931810e5b357c0d5e16fbf1c13e	2020-02-11 14:09:41 -08:00
David Hildenbrand	ed53278ee8	mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section commit `e822969cab` upstream. Patch series "mm: fix max_pfn not falling on section boundary", v2. Playing with different memory sizes for a x86-64 guest, I discovered that some memmaps (highest section if max_mem does not fall on the section boundary) are marked as being valid and online, but contain garbage. We have to properly initialize these memmaps. Looking at /proc/kpageflags and friends, I found some more issues, partially related to this. This patch (of 3): If max_pfn is not aligned to a section boundary, we can easily run into BUGs. This can e.g., be triggered on x86-64 under QEMU by specifying a memory size that is not a multiple of 128MB (e.g., 4097MB, but also 4160MB). I was told that on real HW, we can easily have this scenario (esp., one of the main reasons sub-section hotadd of devmem was added). The issue is, that we have a valid memmap (pfn_valid()) for the whole section, and the whole section will be marked "online". pfn_to_online_page() will succeed, but the memmap contains garbage. E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m 4160M" - (see tools/vm/page-types.c): [ 200.476376] BUG: unable to handle page fault for address: fffffffffffffffe [ 200.477500] #PF: supervisor read access in kernel mode [ 200.478334] #PF: error_code(0x0000) - not-present page [ 200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0 [ 200.479557] Oops: 0000 [#4] SMP NOPTI [ 200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G D W 5.5.0-rc1-next-20191209 #93 [ 200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4 [ 200.481648] RIP: 0010:stable_page_flags+0x4d/0x410 [ 200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f [ 200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202 [ 200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000 [ 200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246 [ 200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000 [ 200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001 [ 200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08 [ 200.487130] FS: 00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000 [ 200.487804] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0 [ 200.488897] Call Trace: [ 200.489115] kpageflags_read+0xe9/0x140 [ 200.489447] proc_reg_read+0x3c/0x60 [ 200.489755] vfs_read+0xc2/0x170 [ 200.490037] ksys_pread64+0x65/0xa0 [ 200.490352] do_syscall_64+0x5c/0xa0 [ 200.490665] entry_SYSCALL_64_after_hwframe+0x49/0xbe But it can be triggered much easier via "cat /proc/kpageflags > /dev/null" after cold/hot plugging a DIMM to such a system: [root@localhost ~]# cat /proc/kpageflags > /dev/null [ 111.517275] BUG: unable to handle page fault for address: fffffffffffffffe [ 111.517907] #PF: supervisor read access in kernel mode [ 111.518333] #PF: error_code(0x0000) - not-present page [ 111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0 This patch fixes that by at least zero-ing out that memmap (so e.g., page_to_pfn() will not crash). Commit `907ec5fca3` ("mm: zero remaining unavailable struct pages") tried to fix a similar issue, but forgot to consider this special case. After this patch, there are still problems to solve. E.g., not all of these pages falling into a memory hole will actually get initialized later and set PageReserved - they are only zeroed out - but at least the immediate crashes are gone. A follow-up patch will take care of this. Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com Fixes: `f7f99100d8` ("mm: stop zeroing memory during allocation in vmemmap") Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Bob Picco <bob.picco@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: <stable@vger.kernel.org> [4.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-02-11 04:35:42 -08:00
Greg Kroah-Hartman	ea962facf5	Merge 5.4.14 into android-5.4 Changes in 5.4.14 ARM: dts: meson8: fix the size of the PMU registers clk: qcom: gcc-sdm845: Add missing flag to votable GDSCs soc: amlogic: meson-ee-pwrc: propagate PD provider registration errors soc: amlogic: meson-ee-pwrc: propagate errors from pm_genpd_init() dt-bindings: reset: meson8b: fix duplicate reset IDs ARM: dts: imx6q-dhcom: fix rtc compatible arm64: dts: ls1028a: fix endian setting for dcfg arm64: dts: imx8mm: Change SDMA1 ahb clock for imx8mm bus: ti-sysc: Fix iterating over clocks clk: Don't try to enable critical clocks if prepare failed Revert "gpio: thunderx: Switch to GPIOLIB_IRQCHIP" arm64: dts: imx8mq-librem5-devkit: use correct interrupt for the magnetometer ASoC: msm8916-wcd-digital: Reset RX interpolation path after use ASoC: stm32: sai: fix possible circular locking ASoC: stm32: dfsdm: fix 16 bits record ASoC: msm8916-wcd-analog: Fix selected events for MIC BIAS External1 ASoC: msm8916-wcd-analog: Fix MIC BIAS Internal1 ARM: OMAP2+: Fix ti_sysc_find_one_clockdomain to check for to_clk_hw_omap ARM: dts: imx7ulp: fix reg of cpu node ARM: dts: imx6q-dhcom: Fix SGTL5000 VDDIO regulator connection ASoC: Intel: bytcht_es8316: Fix Irbis NB41 netbook quirk ALSA: dice: fix fallback from protocol extension into limited functionality ALSA: seq: Fix racy access for queue timer in proc read ALSA: firewire-tascam: fix corruption due to spin lock without restoration in SoftIRQ context ALSA: usb-audio: fix sync-ep altsetting sanity check arm64: dts: allwinner: a64: olinuxino: Fix SDIO supply regulator arm64: dts: allwinner: a64: olinuxino: Fix eMMC supply regulator arm64: dts: agilex/stratix10: fix pmu interrupt numbers Fix built-in early-load Intel microcode alignment clk: sunxi-ng: r40: Allow setting parent rate for external clock outputs block: fix an integer overflow in logical block size fuse: fix fuse_send_readpages() in the syncronous read case io_uring: only allow submit from owning task cpuidle: teo: Fix intervals[] array indexing bug ARM: dts: am571x-idk: Fix gpios property to have the correct gpio number ARM: davinci: select CONFIG_RESET_CONTROLLER perf: Correctly handle failed perf_get_aux_event() iio: adc: ad7124: Fix DT channel configuration iio: imu: st_lsm6dsx: Fix selection of ST_LSM6DS3_ID iio: light: vcnl4000: Fix scale for vcnl4040 iio: chemical: pms7003: fix unmet triggered buffer dependency iio: buffer: align the size of scan bytes to size of the largest element USB: serial: simple: Add Motorola Solutions TETRA MTP3xxx and MTP85xx USB: serial: option: Add support for Quectel RM500Q USB: serial: opticon: fix control-message timeouts USB: serial: option: add support for Quectel RM500Q in QDL mode USB: serial: suppress driver bind attributes USB: serial: ch341: handle unbound port at reset_resume USB: serial: io_edgeport: handle unbound ports on URB completion USB: serial: io_edgeport: add missing active-port sanity check USB: serial: keyspan: handle unbound ports USB: serial: quatech2: handle unbound ports staging: comedi: ni_routes: fix null dereference in ni_find_route_source() staging: comedi: ni_routes: allow partial routing information scsi: fnic: fix invalid stack access scsi: mptfusion: Fix double fetch bug in ioctl ptrace: reintroduce usage of subjective credentials in ptrace_has_cap() mtd: rawnand: gpmi: Fix suspend/resume problem mtd: rawnand: gpmi: Restore nfc timing setup after suspend/resume usb: core: hub: Improved device recognition on remote wakeup cpu/SMT: Fix x86 link error without CONFIG_SYSFS x86/resctrl: Fix an imbalance in domain_remove_cpu() x86/CPU/AMD: Ensure clearing of SME/SEV features is maintained locking/rwsem: Fix kernel crash when spinning on RWSEM_OWNER_UNKNOWN perf/x86/intel/uncore: Fix missing marker for snr_uncore_imc_freerunning_events x86/efistub: Disable paging at mixed mode entry s390/zcrypt: Fix CCA cipher key gen with clear key value function scsi: storvsc: Correctly set number of hardware queues for IDE disk mtd: spi-nor: Fix selection of 4-byte addressing opcodes on Spansion drm/i915: Add missing include file <linux/math64.h> x86/resctrl: Fix potential memory leak efi/earlycon: Fix write-combine mapping on x86 s390/setup: Fix secure ipl message clk: samsung: exynos5420: Keep top G3D clocks enabled perf hists: Fix variable name's inconsistency in hists__for_each() macro locking/lockdep: Fix buffer overrun problem in stack_trace[] perf report: Fix incorrectly added dimensions as switch perf data file mm/shmem.c: thp, shmem: fix conflict of above-47bit hint address and PMD alignment mm/huge_memory.c: thp: fix conflict of above-47bit hint address and PMD alignment mm: memcg/slab: fix percpu slab vmstats flushing mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid mm, debug_pagealloc: don't rely on static keys too early btrfs: rework arguments of btrfs_unlink_subvol btrfs: fix invalid removal of root ref btrfs: do not delete mismatched root refs btrfs: relocation: fix reloc_root lifespan and access btrfs: fix memory leak in qgroup accounting btrfs: check rw_devices, not num_devices for balance Btrfs: always copy scrub arguments back to user space mm/memory_hotplug: don't free usage map when removing a re-added early section mm/page-writeback.c: avoid potential division by zero in wb_min_max_ratio() mm: khugepaged: add trace status description for SCAN_PAGE_HAS_PRIVATE ARM: dts: imx6qdl-sabresd: Remove incorrect power supply assignment ARM: dts: imx6sx-sdb: Remove incorrect power supply assignment ARM: dts: imx6sl-evk: Remove incorrect power supply assignment ARM: dts: imx6sll-evk: Remove incorrect power supply assignment ARM: dts: imx6q-icore-mipi: Use 1.5 version of i.Core MX6DL ARM: dts: imx7: Fix Toradex Colibri iMX7S 256MB NAND flash support net: stmmac: 16KB buffer must be 16 byte aligned net: stmmac: Enable 16KB buffer size reset: Fix {of,devm}_reset_control_array_get kerneldoc return types tipc: fix potential hanging after b/rcast changing tipc: fix retrans failure due to wrong destination net: fix kernel-doc warning in <linux/netdevice.h> block: Fix the type of 'sts' in bsg_queue_rq() drm/amd/display: Reorder detect_edp_sink_caps before link settings read. bpf: Fix incorrect verifier simulation of ARSH under ALU32 bpf: Sockmap/tls, during free we may call tcp_bpf_unhash() in loop bpf: Sockmap, ensure sock lock held during tear down bpf: Sockmap/tls, push write_space updates through ulp updates bpf: Sockmap, skmsg helper overestimates push, pull, and pop bounds bpf: Sockmap/tls, msg_push_data may leave end mark in place bpf: Sockmap/tls, tls_sw can create a plaintext buf > encrypt buf bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining bpf: Sockmap/tls, fix pop data with SK_DROP return code i2c: tegra: Fix suspending in active runtime PM state i2c: tegra: Properly disable runtime PM on driver's probe error cfg80211: fix deadlocks in autodisconnect work cfg80211: fix memory leak in nl80211_probe_mesh_link cfg80211: fix memory leak in cfg80211_cqm_rssi_update cfg80211: fix page refcount issue in A-MSDU decap bpf/sockmap: Read psock ingress_msg before sk_receive_queue i2c: iop3xx: Fix memory leak in probe error path netfilter: fix a use-after-free in mtype_destroy() netfilter: arp_tables: init netns pointer in xt_tgdtor_param struct netfilter: nat: fix ICMP header corruption on ICMP errors netfilter: nft_tunnel: fix null-attribute check netfilter: nft_tunnel: ERSPAN_VERSION must not be null netfilter: nf_tables: remove WARN and add NLA_STRING upper limits netfilter: nf_tables: store transaction list locally while requesting module netfilter: nf_tables: fix flowtable list del corruption NFC: pn533: fix bulk-message timeout net: bpf: Don't leak time wait and request sockets bpftool: Fix printing incorrect pointer in btf_dump_ptr batman-adv: Fix DAT candidate selection on little endian systems macvlan: use skb_reset_mac_header() in macvlan_queue_xmit() hv_netvsc: Fix memory leak when removing rndis device net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key() net: dsa: tag_qca: fix doubled Tx statistics net: hns3: pad the short frame before sending to the hardware net: hns: fix soft lockup when there is not enough memory net: phy: dp83867: Set FORCE_LINK_GOOD to default after reset net/sched: act_ife: initalize ife->metalist earlier net: usb: lan78xx: limit size of local TSO packets net/wan/fsl_ucc_hdlc: fix out of bounds write on array utdm_info ptp: free ptp device pin descriptors properly r8152: add missing endpoint sanity check tcp: fix marked lost packets not being retransmitted bnxt_en: Fix NTUPLE firmware command failures. bnxt_en: Fix ipv6 RFS filter matching logic. bnxt_en: Do not treat DSN (Digital Serial Number) read failure as fatal. net: ethernet: ave: Avoid lockdep warning net: systemport: Fixed queue mapping in internal ring map net: dsa: sja1105: Don't error out on disabled ports with no phy-mode net: dsa: tag_gswip: fix typo in tagger name net: sched: act_ctinfo: fix memory leak net: dsa: bcm_sf2: Configure IMP port for 2Gb/sec i40e: prevent memory leak in i40e_setup_macvlans drm/amdgpu: allow direct upload save restore list for raven2 sh_eth: check sh_eth_cpu_data::dual_port when dumping registers mlxsw: spectrum: Do not modify cloned SKBs during xmit mlxsw: spectrum: Wipe xstats.backlog of down ports mlxsw: spectrum_qdisc: Include MC TCs in Qdisc counters net: stmmac: selftests: Make it work in Synopsys AXS101 boards net: stmmac: selftests: Mark as fail when received VLAN ID != expected selftests: mlxsw: qos_mc_aware: Fix mausezahn invocation net: stmmac: selftests: Update status when disabling RSS net: stmmac: tc: Do not setup flower filtering if RSS is enabled devlink: Wait longer before warning about unset port type xen/blkfront: Adjust indentation in xlvbd_alloc_gendisk dt-bindings: Add missing 'properties' keyword enclosing 'snps,tso' tcp: refine rule to allow EPOLLOUT generation under mem pressure irqchip: Place CONFIG_SIFIVE_PLIC into the menu arm64: dts: qcom: msm8998: Disable coresight by default cw1200: Fix a signedness bug in cw1200_load_firmware() arm64: dts: meson: axg: fix audio fifo reg size arm64: dts: meson: g12: fix audio fifo reg size arm64: dts: meson-gxl-s905x-khadas-vim: fix gpio-keys-polled node arm64: dts: renesas: r8a77970: Fix PWM3 arm64: dts: marvell: Add AP806-dual missing CPU clocks cfg80211: check for set_wiphy_params tick/sched: Annotate lockless access to last_jiffies_update arm64: dts: marvell: Fix CP110 NAND controller node multi-line comment alignment arm64: dts: renesas: r8a774a1: Remove audio port node arm64: dts: imx8mm-evk: Assigned clocks for audio plls arm64: dts: qcom: sdm845-cheza: delete zap-shader ARM: dts: imx6ul-kontron-n6310-s: Disable the snvs-poweroff driver arm64: dts: allwinner: a64: Re-add PMU node ARM: dts: dra7: fix cpsw mdio fck clock arm64: dts: juno: Fix UART frequency ARM: dts: Fix sgx sysconfig register for omap4 Revert "arm64: dts: juno: add dma-ranges property" mtd: devices: fix mchp23k256 read and write mtd: cfi_cmdset_0002: only check errors when ready in cfi_check_err_status() mtd: cfi_cmdset_0002: fix delayed error detection on HyperFlash um: Don't trace irqflags during shutdown um: virtio_uml: Disallow modular build reiserfs: fix handling of -EOPNOTSUPP in reiserfs_for_each_xattr scsi: esas2r: unlock on error in esas2r_nvram_read_direct() scsi: hisi_sas: Don't create debugfs dump folder twice scsi: hisi_sas: Set the BIST init value before enabling BIST scsi: qla4xxx: fix double free bug scsi: bnx2i: fix potential use after free scsi: target: core: Fix a pr_debug() argument scsi: lpfc: fix: Coverity: lpfc_get_scsi_buf_s3(): Null pointer dereferences scsi: hisi_sas: Return directly if init hardware failed scsi: scsi_transport_sas: Fix memory leak when removing devices scsi: qla2xxx: Fix qla2x00_request_irqs() for MSI scsi: qla2xxx: fix rports not being mark as lost in sync fabric scan scsi: core: scsi_trace: Use get_unaligned_be*() scsi: lpfc: Fix list corruption detected in lpfc_put_sgl_per_hdwq scsi: lpfc: Fix hdwq sgl locks and irq handling scsi: lpfc: Fix a kernel warning triggered by lpfc_get_sgl_per_hdwq() rtw88: fix potential read outside array boundary perf probe: Fix wrong address verification perf script: Allow --time with --reltime clk: sprd: Use IS_ERR() to validate the return value of syscon_regmap_lookup_by_phandle() clk: imx7ulp: Correct system clock source option #7 clk: imx7ulp: Correct DDR clock mux options regulator: ab8500: Remove SYSCLKREQ from enum ab8505_regulator_id hwmon: (pmbus/ibm-cffps) Switch LEDs to blocking brightness call hwmon: (pmbus/ibm-cffps) Fix LED blink behavior perf script: Fix --reltime with --time scsi: lpfc: use hdwq assigned cpu for allocation Linux 5.4.14 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I400bdf3be682df698c2477fbf869d5ad8ce300b5	2020-01-23 08:37:43 +01:00
Vlastimil Babka	d30dce3510	mm, debug_pagealloc: don't rely on static keys too early commit `8e57f8acbb` upstream. Commit `96a2b03f28` ("mm, debug_pagelloc: use static keys to enable debugging") has introduced a static key to reduce overhead when debug_pagealloc is compiled in but not enabled. It relied on the assumption that jump_label_init() is called before parse_early_param() as in start_kernel(), so when the "debug_pagealloc=on" option is parsed, it is safe to enable the static key. However, it turns out multiple architectures call parse_early_param() earlier from their setup_arch(). x86 also calls jump_label_init() even earlier, so no issue was found while testing the commit, but same is not true for e.g. ppc64 and s390 where the kernel would not boot with debug_pagealloc=on as found by our QA. To fix this without tricky changes to init code of multiple architectures, this patch partially reverts the static key conversion from `96a2b03f28`. Init-time and non-fastpath calls (such as in arch code) of debug_pagealloc_enabled() will again test a simple bool variable. Fastpath mm code is converted to a new debug_pagealloc_enabled_static() variant that relies on the static key, which is enabled in a well-defined point in mm_init() where it's guaranteed that jump_label_init() has been called, regardless of architecture. [sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early] Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz Fixes: `96a2b03f28` ("mm, debug_pagelloc: use static keys to enable debugging") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Qian Cai <cai@lca.pw> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-01-23 08:22:40 +01:00
Sami Tolvanen	7f498a4b7b	FROMLIST: scs: add accounting This change adds accounting for the memory allocated for shadow stacks. Bug: 145210207 Change-Id: Iee94c22abefcabb63a3bcd4db8ba952130f30a82 (am from https://lore.kernel.org/patchwork/patch/1149055/) Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>	2019-11-27 12:49:09 -08:00
Greg Kroah-Hartman	682d8bf784	Merge tag 'v5.4-rc7' into android-mainline Linux 5.4-rc7 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I505207a0a6f68ccc3519d7f190d8faf25d9d479a	2019-11-11 06:10:55 +01:00
Johannes Weiner	1be334e5c0	mm/page_alloc.c: ratelimit allocation failure warnings more aggressively While investigating a bug related to higher atomic allocation failures, we noticed the failure warnings positively drowning the console, and in our case trigger lockup warnings because of a serial console too slow to handle all that output. But even if we had a faster console, it's unclear what additional information the current level of repetition provides. Allocation failures happen for three reasons: The machine is OOM, the VM is failing to handle reasonable requests, or somebody is making unreasonable requests (and didn't acknowledge their opportunism with __GFP_NOWARN). Having the memory dump, a callstack, and the ratelimit stats on skipped failure warnings should provide enough information to let users/admins/developers know whether something is wrong and point them in the right direction for debugging, bpftracing etc. Limit allocation failure warnings to one spew every ten seconds. Link: http://lkml.kernel.org/r/20191028194906.26899-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-11-06 08:47:50 -08:00
Mel Gorman	3e8fc0075e	mm, meminit: recalculate pcpu batch and high limits after init completes Deferred memory initialisation updates zone->managed_pages during the initialisation phase but before that finishes, the per-cpu page allocator (pcpu) calculates the number of pages allocated/freed in batches as well as the maximum number of pages allowed on a per-cpu list. As zone->managed_pages is not up to date yet, the pcpu initialisation calculates inappropriately low batch and high values. This increases zone lock contention quite severely in some cases with the degree of severity depending on how many CPUs share a local zone and the size of the zone. A private report indicated that kernel build times were excessive with extremely high system CPU usage. A perf profile indicated that a large chunk of time was lost on zone->lock contention. This patch recalculates the pcpu batch and high values after deferred initialisation completes for every populated zone in the system. It was tested on a 2-socket AMD EPYC 2 machine using a kernel compilation workload -- allmodconfig and all available CPUs. mmtests configuration: config-workload-kernbench-max Configuration was modified to build on a fresh XFS partition. kernbench 5.4.0-rc3 5.4.0-rc3 vanilla resetpcpu-v2 Amean user-256 13249.50 ( 0.00%) 16401.31 * -23.79%* Amean syst-256 14760.30 ( 0.00%) 4448.39 * 69.86%* Amean elsp-256 162.42 ( 0.00%) 119.13 * 26.65%* Stddev user-256 42.97 ( 0.00%) 19.15 ( 55.43%) Stddev syst-256 336.87 ( 0.00%) 6.71 ( 98.01%) Stddev elsp-256 2.46 ( 0.00%) 0.39 ( 84.03%) 5.4.0-rc3 5.4.0-rc3 vanilla resetpcpu-v2 Duration User 39766.24 49221.79 Duration System 44298.10 13361.67 Duration Elapsed 519.11 388.87 The patch reduces system CPU usage by 69.86% and total build time by 26.65%. The variance of system CPU usage is also much reduced. Before, this was the breakdown of batch and high values over all zones was: 256 batch: 1 256 batch: 63 512 batch: 7 256 high: 0 256 high: 378 512 high: 42 512 pcpu pagesets had a batch limit of 7 and a high limit of 42. After the patch: 256 batch: 1 768 batch: 63 256 high: 0 768 high: 378 [mgorman@techsingularity.net: fix merge/linkage snafu] Link: http://lkml.kernel.org/r/20191023084705.GD3016@techsingularity.netLink: http://lkml.kernel.org/r/20191021094808.28824-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Qian Cai <cai@lca.pw> Cc: <stable@vger.kernel.org> [4.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-11-06 08:28:58 -08:00
Greg Kroah-Hartman	9be46ff3b6	Merge 5.4-rc4 into android-mainline Linux 5.4-rc4 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I0edccd72fad8b6443b24c8c1005b66d6b8f532ce	2019-10-26 19:24:41 +02:00
Greg Kroah-Hartman	630839ac24	Merge 5.4-rc3 into android-mainline Linux 5.4-rc3 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ia87ba662738dd58ddb917e32c1fbd812861e7a46	2019-10-17 05:28:13 -07:00
David Rientjes	3f36d86694	mm, hugetlb: allow hugepage allocations to reclaim as needed Commit `b39d0ee263` ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed") has chnaged the allocator to bail out from the allocator early to prevent from a potentially excessive memory reclaim. __GFP_RETRY_MAYFAIL is designed to retry the allocation, reclaim and compaction loop as long as there is a reasonable chance to make forward progress. Neither COMPACT_SKIPPED nor COMPACT_DEFERRED at the INIT_COMPACT_PRIORITY compaction attempt gives this feedback. The most obvious affected subsystem is hugetlbfs which allocates huge pages based on an admin request (or via admin configured overcommit). I have done a simple test which tries to allocate half of the memory for hugetlb pages while the memory is full of a clean page cache. This is not an unusual situation because we try to cache as much of the memory as possible and sysctl/sysfs interface to allocate huge pages is there for flexibility to allocate hugetlb pages at any time. System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages after the memory is prefilled by a clean page cache: root@test1:~# cat hugetlb_test.sh set -x echo 0 > /proc/sys/vm/nr_hugepages echo 3 > /proc/sys/vm/drop_caches echo 1 > /proc/sys/vm/compact_memory dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10)) TS=$(date +%s) echo 256 > /proc/sys/vm/nr_hugepages cat /proc/sys/vm/nr_hugepages The results for 2 consecutive runs on clean 5.3 root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s + date +%s + TS=1569905284 + echo 256 + cat /proc/sys/vm/nr_hugepages 256 root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s + date +%s + TS=1569905311 + echo 256 + cat /proc/sys/vm/nr_hugepages 256 Now with `b39d0ee263` applied root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s + date +%s + TS=1569905516 + echo 256 + cat /proc/sys/vm/nr_hugepages 11 root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s + date +%s + TS=1569905541 + echo 256 + cat /proc/sys/vm/nr_hugepages 12 The success rate went down by factor of 20! Although hugetlb allocation requests might fail and it is reasonable to expect them to under extremely fragmented memory or when the memory is under a heavy pressure but the above situation is not that case. Fix the regression by reverting back to the previous behavior for __GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for those requests. Mike said: : hugetlbfs allocations are commonly done via sysctl/sysfs shortly after : boot where this may not be as much of an issue. However, I am aware of at : least three use cases where allocations are made after the system has been : up and running for quite some time: : : - DB reconfiguration. If sysctl/sysfs fails to get required number of : huge pages, system is rebooted to perform allocation after boot. : : - VM provisioning. If unable get required number of huge pages, fall : back to base pages. : : - An application that does not preallocate pool, but rather allocates : pages at fault time for optimal NUMA locality. : : In all cases, I would expect `b39d0ee263` to cause regressions and : noticable behavior changes. : : My quick/limited testing in : https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90bd7e@oracle.com : was insufficient. It was also mentioned that if something like : `b39d0ee263` went forward, I would like exemptions for __GFP_RETRY_MAYFAIL : requests as in this patch. [mhocko@suse.com: reworded changelog] Link: http://lkml.kernel.org/r/20191007075548.12456-1-mhocko@kernel.org Fixes: `b39d0ee263` ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed") Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-10-14 15:04:01 -07:00
Qian Cai	234fdce892	mm/page_alloc.c: fix a crash in free_pages_prepare() On architectures like s390, arch_free_page() could mark the page unused (set_page_unused()) and any access later would trigger a kernel panic. Fix it by moving arch_free_page() after all possible accessing calls. Hardware name: IBM 2964 N96 400 (z/VM 6.4.0) Krnl PSW : 0404e00180000000 0000000026c2b96e (__free_pages_ok+0x34e/0x5d8) R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 Krnl GPRS: 0000000088d43af7 0000000000484000 000000000000007c 000000000000000f 000003d080012100 000003d080013fc0 0000000000000000 0000000000100000 00000000275cca48 0000000000000100 0000000000000008 000003d080010000 00000000000001d0 000003d000000000 0000000026c2b78a 000000002717fdb0 Krnl Code: 0000000026c2b95c: ec1100b30659 risbgn %r1,%r1,0,179,6 0000000026c2b962: e32014000036 pfd 2,1024(%r1) #0000000026c2b968: d7ff10001000 xc 0(256,%r1),0(%r1) >0000000026c2b96e: 41101100 la %r1,256(%r1) 0000000026c2b972: a737fff8 brctg %r3,26c2b962 0000000026c2b976: d7ff10001000 xc 0(256,%r1),0(%r1) 0000000026c2b97c: e31003400004 lg %r1,832 0000000026c2b982: ebff1430016a asi 5168(%r1),-1 Call Trace: __free_pages_ok+0x16a/0x5d8) memblock_free_all+0x206/0x290 mem_init+0x58/0x120 start_kernel+0x2b0/0x570 startup_continue+0x6a/0xc0 INFO: lockdep is turned off. Last Breaking-Event-Address: __free_pages_ok+0x372/0x5d8 Kernel panic - not syncing: Fatal exception: panic_on_oops 00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 26A2379C In the past, only kernel_poison_pages() would trigger this but it needs "page_poison=on" kernel cmdline, and I suspect nobody tested that on s390. Recently, kernel_init_free_pages() (commit `6471384af2` ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")) was added and could trigger this as well. [akpm@linux-foundation.org: add comment] Link: http://lkml.kernel.org/r/1569613623-16820-1-git-send-email-cai@lca.pw Fixes: `8823b1dbc0` ("mm/page_poison.c: enable PAGE_POISONING as a separate option") Fixes: `6471384af2` ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options") Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: <stable@vger.kernel.org> [5.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-10-07 15:47:19 -07:00
Greg Kroah-Hartman	cb33d78781	Merge 5.4-rc1 into android-mainline Linux 5.4-rc1 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I15eec52df70f829acf81ff614a1c2a5fb443a4e0	2019-10-02 19:10:07 +02:00
Greg Kroah-Hartman	94139142d9	Merge 5.4-rc1-prelrease into android-mainline To make the 5.4-rc1 merge easier, merge at a prerelease point in time before the final release happens. Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: If613d657fd0abf9910c5bf3435a745f01b89765e	2019-10-02 17:58:47 +02:00
Linus Torvalds	edf445ad7c	Merge branch 'hugepage-fallbacks' (hugepatch patches from David Rientjes) Merge hugepage allocation updates from David Rientjes: "We (mostly Linus, Andrea, and myself) have been discussing offlist how to implement a sane default allocation strategy for hugepages on NUMA platforms. With these reverts in place, the page allocator will happily allocate a remote hugepage immediately rather than try to make a local hugepage available. This incurs a substantial performance degradation when memory compaction would have otherwise made a local hugepage available. This series reverts those reverts and attempts to propose a more sane default allocation strategy specifically for hugepages. Andrea acknowledges this is likely to fix the swap storms that he originally reported that resulted in the patches that removed __GFP_THISNODE from hugepage allocations. The immediate goal is to return 5.3 to the behavior the kernel has implemented over the past several years so that remote hugepages are not immediately allocated when local hugepages could have been made available because the increased access latency is untenable. The next goal is to introduce a sane default allocation strategy for hugepages allocations in general regardless of the configuration of the system so that we prevent thrashing of local memory when compaction is unlikely to succeed and can prefer remote hugepages over remote native pages when the local node is low on memory." Note on timing: this reverts the hugepage VM behavior changes that got introduced fairly late in the 5.3 cycle, and that fixed a huge performance regression for certain loads that had been around since 4.18. Andrea had this note: "The regression of 4.18 was that it was taking hours to start a VM where 3.10 was only taking a few seconds, I reported all the details on lkml when it was finally tracked down in August 2018. https://lore.kernel.org/linux-mm/20180820032640.9896-2-aarcange@redhat.com/ __GFP_THISNODE in MADV_HUGEPAGE made the above enterprise vfio workload degrade like in the "current upstream" above. And it still would have been that bad as above until 5.3-rc5" where the bad behavior ends up happening as you fill up a local node, and without that change, you'd get into the nasty swap storm behavior due to compaction working overtime to make room for more memory on the nodes. As a result 5.3 got the two performance fix reverts in rc5. However, David Rientjes then noted that those performance fixes in turn regressed performance for other loads - although not quite to the same degree. He suggested reverting the reverts and instead replacing them with two small changes to how hugepage allocations are done (patch descriptions rephrased by me): - "avoid expensive reclaim when compaction may not succeed": just admit that the allocation failed when you're trying to allocate a huge-page and compaction wasn't successful. - "allow hugepage fallback to remote nodes when madvised": when that node-local huge-page allocation failed, retry without forcing the local node. but by then I judged it too late to replace the fixes for a 5.3 release. So 5.3 was released with behavior that harked back to the pre-4.18 logic. But now we're in the merge window for 5.4, and we can see if this alternate model fixes not just the horrendous swap storm behavior, but also restores the performance regression that the late reverts caused. Fingers crossed. * emailed patches from David Rientjes <rientjes@google.com>: mm, page_alloc: allow hugepage fallback to remote nodes when madvised mm, page_alloc: avoid expensive reclaim when compaction may not succeed Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" Revert "Revert "mm, thp: restore node-local hugepage allocations""	2019-09-28 14:26:47 -07:00
David Rientjes	b39d0ee263	mm, page_alloc: avoid expensive reclaim when compaction may not succeed Memory compaction has a couple significant drawbacks as the allocation order increases, specifically: - isolate_freepages() is responsible for finding free pages to use as migration targets and is implemented as a linear scan of memory starting at the end of a zone, - failing order-0 watermark checks in memory compaction does not account for how far below the watermarks the zone actually is: to enable migration, there must be some free memory available. Per the above, watermarks are not always suffficient if isolate_freepages() cannot find the free memory but it could require hundreds of MBs of reclaim to even reach this threshold (read: potentially very expensive reclaim with no indication compaction can be successful), and - if compaction at this order has failed recently so that it does not even run as a result of deferred compaction, looping through reclaim can often be pointless. For hugepage allocations, these are quite substantial drawbacks because these are very high order allocations (order-9 on x86) and falling back to doing reclaim can potentially be very expensive without any indication that compaction would even be successful. Reclaim itself is unlikely to free entire pageblocks and certainly no reliance should be put on it to do so in isolation (recall lumpy reclaim). This means we should avoid reclaim and simply fail hugepage allocation if compaction is deferred. It is also not helpful to thrash a zone by doing excessive reclaim if compaction may not be able to access that memory. If order-0 watermarks fail and the allocation order is sufficiently large, it is likely better to fail the allocation rather than thrashing the zone. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-09-28 14:05:38 -07:00
Yang Shi	7ae88534cd	mm: move mem_cgroup_uncharge out of __page_cache_release() A later patch makes THP deferred split shrinker memcg aware, but it needs page->mem_cgroup information in THP destructor, which is called after mem_cgroup_uncharge() now. So move mem_cgroup_uncharge() from __page_cache_release() to compound page destructor, which is called by both THP and other compound pages except HugeTLB. And call it in __put_single_page() for single order page. Link: http://lkml.kernel.org/r/1565144277-36240-3-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Suggested-by: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-09-24 15:54:11 -07:00
Yang Shi	364c1eebe4	mm: thp: extract split_queue_* into a struct Patch series "Make deferred split shrinker memcg aware", v6. Currently THP deferred split shrinker is not memcg aware, this may cause premature OOM with some configuration. For example the below test would run into premature OOM easily: $ cgcreate -g memory:thp $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes $ cgexec -g memory:thp transhuge-stress 4000 transhuge-stress comes from kernel selftest. It is easy to hit OOM, but there are still a lot THP on the deferred split queue, memcg direct reclaim can't touch them since the deferred split shrinker is not memcg aware. Convert deferred split shrinker memcg aware by introducing per memcg deferred split queue. The THP should be on either per node or per memcg deferred split queue if it belongs to a memcg. When the page is immigrated to the other memcg, it will be immigrated to the target memcg's deferred split queue too. Reuse the second tail page's deferred_list for per memcg list since the same THP can't be on multiple deferred split queues. Make deferred split shrinker not depend on memcg kmem since it is not slab. It doesn't make sense to not shrink THP even though memcg kmem is disabled. With the above change the test demonstrated above doesn't trigger OOM even though with cgroup.memory=nokmem. This patch (of 4): Put split_queue, split_queue_lock and split_queue_len into a struct in order to reduce code duplication when we convert deferred_split to memcg aware in the later patches. Link: http://lkml.kernel.org/r/1565144277-36240-2-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Suggested-by: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-09-24 15:54:11 -07:00
Vlastimil Babka	4943308556	mm, compaction: raise compaction priority after it withdrawns Mike Kravetz reports that "hugetlb allocations could stall for minutes or hours when should_compact_retry() would return true more often then it should. Specifically, this was in the case where compact_result was COMPACT_DEFERRED and COMPACT_PARTIAL_SKIPPED and no progress was being made." The problem is that the compaction_withdrawn() test in should_compact_retry() includes compaction outcomes that are only possible on low compaction priority, and results in a retry without increasing the priority. This may result in furter reclaim, and more incomplete compaction attempts. With this patch, compaction priority is raised when possible, or should_compact_retry() returns false. The COMPACT_SKIPPED result doesn't really fit together with the other outcomes in compaction_withdrawn(), as that's a result caused by insufficient order-0 pages, not due to low compaction priority. With this patch, it is moved to a new compaction_needs_reclaim() function, and for that outcome we keep the current logic of retrying if it looks like reclaim will be able to help. Link: http://lkml.kernel.org/r/20190806014744.15446-4-mike.kravetz@oracle.com Reported-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-09-24 15:54:10 -07:00
Matthew Wilcox (Oracle)	d8c6546b1a	mm: introduce compound_nr() Replace 1 << compound_order(page) with compound_nr(page). Minor improvements in readability. Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-09-24 15:54:08 -07:00
Greg Kroah-Hartman	896be8f44d	Merge 5.4-rc1-prereleae into android-mainline To make the 5.4-rc1 merge easier, merge at a prerelease point in time before the final release happens. Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I29b683c837ed1a3324644dbf9bf863f30740cd0b	2019-09-23 14:14:08 +02:00
Linus Torvalds	84da111de0	Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull hmm updates from Jason Gunthorpe: "This is more cleanup and consolidation of the hmm APIs and the very strongly related mmu_notifier interfaces. Many places across the tree using these interfaces are touched in the process. Beyond that a cleanup to the page walker API and a few memremap related changes round out the series: - General improvement of hmm_range_fault() and related APIs, more documentation, bug fixes from testing, API simplification & consolidation, and unused API removal - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE, and make them internal kconfig selects - Hoist a lot of code related to mmu notifier attachment out of drivers by using a refcount get/put attachment idiom and remove the convoluted mmu_notifier_unregister_no_release() and related APIs. - General API improvement for the migrate_vma API and revision of its only user in nouveau - Annotate mmu_notifiers with lockdep and sleeping region debugging Two series unrelated to HMM or mmu_notifiers came along due to dependencies: - Allow pagemap's memremap_pages family of APIs to work without providing a struct device - Make walk_page_range() and related use a constant structure for function pointers" * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits) libnvdimm: Enable unit test infrastructure compile checks mm, notifier: Catch sleeping/blocking for !blockable kernel.h: Add non_block_start/end() drm/radeon: guard against calling an unpaired radeon_mn_unregister() csky: add missing brackets in a macro for tlb.h pagewalk: use lockdep_assert_held for locking validation pagewalk: separate function pointers from iterator data mm: split out a new pagewalk.h header from mm.h mm/mmu_notifiers: annotate with might_sleep() mm/mmu_notifiers: prime lockdep mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports mm/hmm: hmm_range_fault() infinite loop mm/hmm: hmm_range_fault() NULL pointer bug mm/hmm: fix hmm_range_fault()'s handling of swapped out pages mm/mmu_notifiers: remove unregister_no_release RDMA/odp: remove ib_ucontext from ib_umem RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm' RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr RDMA/mlx5: Use ib_umem_start instead of umem.address ...	2019-09-21 10:07:42 -07:00
Greg Kroah-Hartman	bfa0399bc8	Merge Linus's 5.4-rc1-prerelease branch into android-mainline This merges Linus's tree as of commit `b41dae061b` ("Merge tag 'xfs-5.4-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux") into android-mainline. This "early" merge makes it easier to test and handle merge conflicts instead of having to wait until the "end" of the merge window and handle all 10000+ commits at once. Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I6bebf55e5e2353f814e3c87f5033607b1ae5d812	2019-09-20 16:07:54 -07:00
Linus Torvalds	7e67a85999	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann, Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers. As perf and the scheduler is getting bigger and more complex, document the status quo of current responsibilities and interests, and spread the review pain^H^H^H^H fun via an increase in the Cc: linecount generated by scripts/get_maintainer.pl. :-) - Add another series of patches that brings the -rt (PREEMPT_RT) tree closer to mainline: split the monolithic CONFIG_PREEMPT dependencies into a new CONFIG_PREEMPTION category that will allow the eventual introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches to go though. - Extend the CPU cgroup controller with uclamp.min and uclamp.max to allow the finer shaping of CPU bandwidth usage. - Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS). - Improve the behavior of high CPU count, high thread count applications running under cpu.cfs_quota_us constraints. - Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present. - Improve CPU isolation housekeeping CPU allocation NUMA locality. - Fix deadline scheduler bandwidth calculations and logic when cpusets rebuilds the topology, or when it gets deadline-throttled while it's being offlined. - Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from setscheduler() system calls without creating global serialization. Add new synchronization between cpuset topology-changing events and the deadline acceptance tests in setscheduler(), which were broken before. - Rework the active_mm state machine to be less confusing and more optimal. - Rework (simplify) the pick_next_task() slowpath. - Improve load-balancing on AMD EPYC systems. - ... and misc cleanups, smaller fixes and improvements - please see the Git log for more details. * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits) sched/psi: Correct overly pessimistic size calculation sched/fair: Speed-up energy-aware wake-ups sched/uclamp: Always use 'enum uclamp_id' for clamp_id values sched/uclamp: Update CPU's refcount on TG's clamp changes sched/uclamp: Use TG's clamps to restrict TASK's clamps sched/uclamp: Propagate system defaults to the root group sched/uclamp: Propagate parent clamps sched/uclamp: Extend CPU's cgroup controller sched/topology: Improve load balancing on AMD EPYC systems arch, ia64: Make NUMA select SMP sched, perf: MAINTAINERS update, add submaintainers and reviewers sched/fair: Use rq_lock/unlock in online_fair_sched_group cpufreq: schedutil: fix equation in comment sched: Rework pick_next_task() slow-path sched: Allow put_prev_task() to drop rq->lock sched/fair: Expose newidle_balance() sched: Add task_struct pointer to sched_class::set_curr_task sched: Rework CPU hotplug task selection sched/{rt,deadline}: Fix set_next_task vs pick_next_task sched: Fix kerneldoc comment for ia64_set_curr_task ...	2019-09-16 17:25:49 -07:00
Matt Fleming	a55c7454a8	sched/topology: Improve load balancing on AMD EPYC systems SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init() for any sched domains with a NUMA distance greater than 2 hops (RECLAIM_DISTANCE). The idea being that it's expensive to balance across domains that far apart. However, as is rather unfortunately explained in: commit `32e45ff43e` ("mm: increase RECLAIM_DISTANCE to 30") the value for RECLAIM_DISTANCE is based on node distance tables from 2011-era hardware. Current AMD EPYC machines have the following NUMA node distances: node distances: node 0 1 2 3 4 5 6 7 0: 10 16 16 16 32 32 32 32 1: 16 10 16 16 32 32 32 32 2: 16 16 10 16 32 32 32 32 3: 16 16 16 10 32 32 32 32 4: 32 32 32 32 10 16 16 16 5: 32 32 32 32 16 10 16 16 6: 32 32 32 32 16 16 10 16 7: 32 32 32 32 16 16 16 10 where 2 hops is 32. The result is that the scheduler fails to load balance properly across NUMA nodes on different sockets -- 2 hops apart. For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4 (CPUs 32-39) like so, $ numactl -C 0-7,32-39 ./spinner 16 causes all threads to fork and remain on node 0 until the active balancer kicks in after a few seconds and forcibly moves some threads to node 4. Override node_reclaim_distance for AMD Zen. Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suravee.Suthikulpanit@amd.com Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas.Lendacky@amd.com Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190808195301.13222-3-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>	2019-09-03 09:17:37 +02:00
Greg Kroah-Hartman	ad455d87e5	Merge 5.3-rc6 into android-mainline Linux 5.3-rc6 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Id10580d48d56054408b3efe0bd1866d67aba2a3d	2019-08-26 16:45:30 +02:00
David Rientjes	cd96103838	mm, page_alloc: move_freepages should not examine struct page of reserved memory After commit `907ec5fca3` ("mm: zero remaining unavailable struct pages"), struct page of reserved memory is zeroed. This causes page->flags to be 0 and fixes issues related to reading /proc/kpageflags, for example, of reserved memory. The VM_BUG_ON() in move_freepages_block(), however, assumes that page_zone() is meaningful even for reserved memory. That assumption is no longer true after the aforementioned commit. There's no reason why move_freepages_block() should be testing the legitimacy of page_zone() for reserved memory; its scope is limited only to pages on the zone's freelist. Note that pfn_valid() can be true for reserved memory: there is a backing struct page. The check for page_to_nid(page) is also buggy but reserved memory normally only appears on node 0 so the zeroing doesn't affect this. Move the debug checks to after verifying PageBuddy is true. This isolates the scope of the checks to only be for buddy pages which are on the zone's freelist which move_freepages_block() is operating on. In this case, an incorrect node or zone is a bug worthy of being warned about (and the examination of struct page is acceptable bcause this memory is not reserved). Why does move_freepages_block() gets called on reserved memory? It's simply math after finding a valid free page from the per-zone free area to use as fallback. We find the beginning and end of the pageblock of the valid page and that can bring us into memory that was reserved per the e820. pfn_valid() is still true (it's backed by a struct page), but since it's zero'd we shouldn't make any inferences here about comparing its node or zone. The current node check just happens to succeed most of the time by luck because reserved memory typically appears on node 0. The fix here is to validate that we actually have buddy pages before testing if there's any type of zone or node strangeness going on. We noticed it almost immediately after bringing `907ec5fca3` in on CONFIG_DEBUG_VM builds. It depends on finding specific free pages in the per-zone free area where the math in move_freepages() will bring the start or end pfn into reserved memory and wanting to claim that entire pageblock as a new migratetype. So the path will be rare, require CONFIG_DEBUG_VM, and require fallback to a different migratetype. Some struct pages were already zeroed from reserve pages before 907ec5fca3c so it theoretically could trigger before this commit. I think it's rare enough under a config option that most people don't run that others may not have noticed. I wouldn't argue against a stable tag and the backport should be easy enough, but probably wouldn't single out a commit that this is fixing. Mel said: : The overhead of the debugging check is higher with this patch although : it'll only affect debug builds and the path is not particularly hot. : If this was a concern, I think it would be reasonable to simply remove : the debugging check as the zone boundaries are checked in : move_freepages_block and we never expect a zone/node to be smaller than : a pageblock and stuck in the middle of another zone. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1908122036560.10779@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-08-24 19:48:42 -07:00
Christoph Hellwig	fdc029b19d	memremap: remove the dev field in struct dev_pagemap The dev field in struct dev_pagemap is only used to print dev_name in two places, which are at best nice to have. Just remove the field and thus the name in those two messages. Link: https://lore.kernel.org/r/20190818090557.17853-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Bharata B Rao <bharata@linux.ibm.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>	2019-08-20 09:41:35 -03:00
Greg Kroah-Hartman	37766c2946	Merge 5.3.0-rc1 into android-mainline Linus 5.3-rc1 release Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ic171e37d4c21ffa495240c5538852bbb5a9dcce8	2019-07-23 16:21:59 -07:00
Dan Williams	ba72b4c8cf	mm/sparsemem: support sub-section hotplug The libnvdimm sub-system has suffered a series of hacks and broken workarounds for the memory-hotplug implementation's awkward section-aligned (128MB) granularity. For example the following backtrace is emitted when attempting arch_add_memory() with physical address ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section: # cat /proc/iomem \| grep -A1 -B1 Persistent\ Memory 100000000-1ffffffff : System RAM 200000000-303ffffff : Persistent Memory (legacy) 304000000-43fffffff : System RAM 440000000-23ffffffff : Persistent Memory 2400000000-43bfffffff : Persistent Memory 2400000000-43bfffffff : namespace2.0 WARNING: CPU: 38 PID: 928 at arch/x86/mm/init_64.c:850 add_pages+0x5c/0x60 [..] RIP: 0010:add_pages+0x5c/0x60 [..] Call Trace: devm_memremap_pages+0x460/0x6e0 pmem_attach_disk+0x29e/0x680 [nd_pmem] ? nd_dax_probe+0xfc/0x120 [libnvdimm] nvdimm_bus_probe+0x66/0x160 [libnvdimm] It was discovered that the problem goes beyond RAM vs PMEM collisions as some platform produce PMEM vs PMEM collisions within a given section. The libnvdimm workaround for that case revealed that the libnvdimm section-alignment-padding implementation has been broken for a long while. A fix for that long-standing breakage introduces as many problems as it solves as it would require a backward-incompatible change to the namespace metadata interpretation. Instead of that dubious route [1], address the root problem in the memory-hotplug implementation. Note that EEXIST is no longer treated as success as that is how sparse_add_section() reports subsection collisions, it was also obviated by recent changes to perform the request_region() for 'System RAM' before arch_add_memory() in the add_memory() sequence. [1] https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com [osalvador@suse.de: fix deactivate_section for early sections] Link: http://lkml.kernel.org/r/20190715081549.32577-2-osalvador@suse.de Link: http://lkml.kernel.org/r/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Oscar Salvador <osalvador@suse.de> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64] Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-07-18 17:08:07 -07:00
Dan Williams	46d945aeab	mm: kill is_dev_zone() helper Given there are no more usages of is_dev_zone() outside of 'ifdef CONFIG_ZONE_DEVICE' protection, kill off the compilation helper. Link: http://lkml.kernel.org/r/156092353211.979959.1489004866360828964.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Wei Yang <richardw.yang@linux.intel.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64] Cc: Michal Hocko <mhocko@suse.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-07-18 17:08:07 -07:00
Dan Williams	f46edbd1b1	mm/sparsemem: add helpers track active portions of a section at boot Prepare for hot{plug,remove} of sub-ranges of a section by tracking a sub-section active bitmask, each bit representing a PMD_SIZE span of the architecture's memory hotplug section size. The implications of a partially populated section is that pfn_valid() needs to go beyond a valid_section() check and either determine that the section is an "early section", or read the sub-section active ranges from the bitmask. The expectation is that the bitmask (subsection_map) fits in the same cacheline as the valid_section() / early_section() data, so the incremental performance overhead to pfn_valid() should be negligible. The rationale for using early_section() to short-ciruit the subsection_map check is that there are legacy code paths that use pfn_valid() at section granularity before validating the pfn against pgdat data. So, the early_section() check allows those traditional assumptions to persist while also permitting subsection_map to tell the truth for purposes of populating the unused portions of early sections with PMEM and other ZONE_DEVICE mappings. Link: http://lkml.kernel.org/r/156092350874.979959.18185938451405518285.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Qian Cai <cai@lca.pw> Tested-by: Jane Chu <jane.chu@oracle.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64] Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-07-18 17:08:07 -07:00
Dan Williams	f1eca35a0d	mm/sparsemem: introduce struct mem_section_usage Patch series "mm: Sub-section memory hotplug support", v10. The memory hotplug section is an arbitrary / convenient unit for memory hotplug. 'Section-size' units have bled into the user interface ('memblock' sysfs) and can not be changed without breaking existing userspace. The section-size constraint, while mostly benign for typical memory hotplug, has and continues to wreak havoc with 'device-memory' use cases, persistent memory (pmem) in particular. Recall that pmem uses devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a 'struct page' memmap for pmem. However, it does not use the 'bottom half' of memory hotplug, i.e. never marks pmem pages online and never exposes the userspace memblock interface for pmem. This leaves an opening to redress the section-size constraint. To date, the libnvdimm subsystem has attempted to inject padding to satisfy the internal constraints of arch_add_memory(). Beyond complicating the code, leading to bugs [2], wasting memory, and limiting configuration flexibility, the padding hack is broken when the platform changes this physical memory alignment of pmem from one boot to the next. Device failure (intermittent or permanent) and physical reconfiguration are events that can cause the platform firmware to change the physical placement of pmem on a subsequent boot, and device failure is an everyday event in a data-center. It turns out that sections are only a hard requirement of the user-facing interface for memory hotplug and with a bit more infrastructure sub-section arch_add_memory() support can be added for kernel internal usages like devm_memremap_pages(). Here is an analysis of the current design assumptions in the current code and how they are addressed in the new implementation: Current design assumptions: - Sections that describe boot memory (early sections) are never unplugged / removed. - pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a valid_section() check - __add_pages() and helper routines assume all operations occur in PAGES_PER_SECTION units. - The memblock sysfs interface only comprehends full sections New design assumptions: - Sections are instrumented with a sub-section bitmask to track (on x86) individual 2MB sub-divisions of a 128MB section. - Partially populated early sections can be extended with additional sub-sections, and those sub-sections can be removed with arch_remove_memory(). With this in place we no longer lose usable memory capacity to padding. - pfn_valid() is updated to look deeper than valid_section() to also check the active-sub-section mask. This indication is in the same cacheline as the valid_section() so the performance impact is expected to be negligible. So far the lkp robot has not reported any regressions. - Outside of the core vmemmap population routines which are replaced, other helper routines like shrink_{zone,pgdat}_span() are updated to handle the smaller granularity. Core memory hotplug routines that deal with online memory are not touched. - The existing memblock sysfs user api guarantees / assumptions are not touched since this capability is limited to !online !memblock-sysfs-accessible sections. Meanwhile the issue reports continue to roll in from users that do not understand when and how the 128MB constraint will bite them. The current implementation relied on being able to support at least one misaligned namespace, but that immediately falls over on any moderately complex namespace creation attempt. Beyond the initial problem of 'System RAM' colliding with pmem, and the unsolvable problem of physical alignment changes, Linux is now being exposed to platforms that collide pmem ranges with other pmem ranges by default [3]. In short, devm_memremap_pages() has pushed the venerable section-size constraint past the breaking point, and the simplicity of section-aligned arch_add_memory() is no longer tenable. These patches are exposed to the kbuild robot on a subsection-v10 branch [4], and a preview of the unit test for this functionality is available on the 'subsection-pending' branch of ndctl [5]. [2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com [3]: https://github.com/pmem/ndctl/issues/76 [4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10 [5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c This patch (of 13): Towards enabling memory hotplug to track partial population of a section, introduce 'struct mem_section_usage'. A pointer to a 'struct mem_section_usage' instance replaces the existing pointer to a 'pageblock_flags' bitmap. Effectively it adds one more 'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house a new 'subsection_map' bitmap. The new bitmap enables the memory hot{plug,remove} implementation to act on incremental sub-divisions of a section. SUBSECTION_SHIFT is defined as global constant instead of per-architecture value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of subsection users. Specifically a common subsection size allows for the possibility that persistent memory namespace configurations be made compatible across architectures. The primary motivation for this functionality is to support platforms that mix "System RAM" and "Persistent Memory" within a single section, or multiple PMEM ranges with different mapping lifetimes within a single section. The section restriction for hotplug has caused an ongoing saga of hacks and bugs for devm_memremap_pages() users. Beyond the fixups to teach existing paths how to retrieve the 'usemap' from a section, and updates to usemap allocation path, there are no expected behavior changes. Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Wei Yang <richardw.yang@linux.intel.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64] Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Qian Cai <cai@lca.pw> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-07-18 17:08:07 -07:00
Yafang Shao	e5ca8071fe	mm/vmscan.c: add a new member reclaim_state in struct shrink_control Patch series "mm/vmscan: calculate reclaimed slab in all reclaim paths". This patchset is to fix the issues in doing shrink slab. There're six different reclaim paths by now, - kswapd reclaim path - node reclaim path - hibernate preallocate memory reclaim path - direct reclaim path - memcg reclaim path - memcg softlimit reclaim path The slab caches reclaimed in these paths are only calculated in the above three paths. The issues are detailed explained in patch #2. We should calculate the reclaimed slab caches in every reclaim path. In order to do it, the struct reclaim_state is placed into the struct shrink_control. In node reclaim path, there'is another issue about shrinking slab, which is adressed in "mm/vmscan: shrink slab in node reclaim" (https://lore.kernel.org/linux-mm/1559874946-22960-1-git-send-email-laoar.shao@gmail.com/). This patch (of 2): The struct reclaim_state is used to record how many slab caches are reclaimed in one reclaim path. The struct shrink_control is used to control one reclaim path. So we'd better put reclaim_state into shrink_control. [laoar.shao@gmail.com: remove reclaim_state assignment from __perform_reclaim()] Link: http://lkml.kernel.org/r/1561381582-13697-1-git-send-email-laoar.shao@gmail.com Link: http://lkml.kernel.org/r/1561112086-6169-2-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-07-16 19:23:21 -07:00
Linus Torvalds	fec88ab0af	Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull HMM updates from Jason Gunthorpe: "Improvements and bug fixes for the hmm interface in the kernel: - Improve clarity, locking and APIs related to the 'hmm mirror' feature merged last cycle. In linux-next we now see AMDGPU and nouveau to be using this API. - Remove old or transitional hmm APIs. These are hold overs from the past with no users, or APIs that existed only to manage cross tree conflicts. There are still a few more of these cleanups that didn't make the merge window cut off. - Improve some core mm APIs: - export alloc_pages_vma() for driver use - refactor into devm_request_free_mem_region() to manage DEVICE_PRIVATE resource reservations - refactor duplicative driver code into the core dev_pagemap struct - Remove hmm wrappers of improved core mm APIs, instead have drivers use the simplified API directly - Remove DEVICE_PUBLIC - Simplify the kconfig flow for the hmm users and core code" * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits) mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR mm: remove the HMM config option mm: sort out the DEVICE_PRIVATE Kconfig mess mm: simplify ZONE_DEVICE page private data mm: remove hmm_devmem_add mm: remove hmm_vma_alloc_locked_page nouveau: use devm_memremap_pages directly nouveau: use alloc_page_vma directly PCI/P2PDMA: use the dev_pagemap internal refcount device-dax: use the dev_pagemap internal refcount memremap: provide an optional internal refcount in struct dev_pagemap memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag memremap: remove the data field in struct dev_pagemap memremap: add a migrate_to_ram method to struct dev_pagemap_ops memremap: lift the devmap_enable manipulation into devm_memremap_pages memremap: pass a struct dev_pagemap to ->kill and ->cleanup memremap: move dev_pagemap callbacks into a separate structure memremap: validate the pagemap type passed to devm_memremap_pages mm: factor out a devm_request_free_mem_region helper mm: export alloc_pages_vma ...	2019-07-14 19:42:11 -07:00

1 2 3 4 5 ...

1556 Commits