Skip to content

Commit b102bc4

Browse files
idontkonwherrisemeup1SigureMogouzilwwbitejotunn
authored
【MetaX】Merge Metax's modifications to mxmaca/2.6 branch (#68534)
* fix windows bug for common lib (#60308) * fix windows bug * fix windows bug * fix windows bug * fix windows bug * fix windows bug * fix windows bug * Update inference_lib.cmake * [Dy2St] Disable `test_bert` on CPU (#60173) (#60324) Co-authored-by: gouzil <[email protected]> * [Cherry-pick] fix weight quant kernel bug when n div 64 != 0 (#60184) * fix weight-only quant kernel error for n div 64 !=0 * code style fix * tile (#60261) * add chunk allocator posix_memalign return value check (#60208) (#60495) * fix chunk allocator posix_memalign return value check;test=develop * fix chunk allocator posix_memalign return value check;test=develop * fix chunk allocator posix_memalign return value check;test=develop * update 2023 security advisory, test=document_fix (#60532) * fix fleetutil get_online_pass_interval bug2; test=develop (#60545) * fix fused_rope diff (#60217) (#60593) * [cherry-pick]fix fleetutil get_online_pass_interval bug3 (#60620) * fix fleetutil get_online_pass_interval bug3; test=develop * fix fleetutil get_online_pass_interval bug3; test=develop * fix fleetutil get_online_pass_interval bug3; test=develop * [cherry-pick]update pdsa-2023-019 (#60649) * update 2023 security advisory, test=document_fix * update pdsa-2023-019, test=document_fix * [Dy2St][2.6] Disable `test_grad` on release/2.6 (#60662) * fix bug of ci (#59926) (#60785) * [Dy2St][2.6] Disable `test_transformer` on `release/2.6` and update README (#60786) * [Dy2St][2.6] Disable `test_transformer` on release/2.6 and update README * [Docs] Update latest release version in README (#60691) * restore order * [Dy2St][2.6] Increase `test_transformer` and `test_mobile_net` ut time (#60829) (#60875) * [Cherry-pick] fix set_value with scalar grad (#60930) * Fix set value grad (#59034) * first fix the UT * fix set value grad * polish code * add static mode backward test * always has input valuetensor * add dygraph test * Fix shape error in combined-indexing setitem (#60447) * add ut * fix shape error in combine-indexing * fix ut * Set value with scalar (#60452) * set_value with scalar * fix ut * remove test_pir * remove one test since 2.6 not support uint8-add * [cherry-pick] This PR enable offset of generator for custom device. (#60616) (#60772) * fix core dump when fallback gather_nd_grad and MemoryAllocateHost (#61067) * fix qat tests (#61211) (#61284) * [Security] fix draw security problem (#61161) (#61338) * fix draw security problem * fix _decompress security problem (#61294) (#61337) * Fix CVE-2024-0521 (#61032) (#61287) This uses shlex for safe command parsing to fix arbitrary code injection Co-authored-by: ndren <[email protected]> * [Security] fix security problem for prune_by_memory_estimation (#61382) * OS Command Injection prune_by_memory_estimation fix * Fix StyleCode * [Security] fix security problem for run_cmd (#61285) (#61398) * fix security problem for run_cmd * [Security] fix download security problem (#61162) (#61388) * fix download security problem * check eval for security (#61389) * [cherry-pick] adapt c_embedding to phi namespace for custom devices (#60774) (#61045) Co-authored-by: Tian <[email protected]> * [CherryPick] Fix issue 60092 (#61427) * fix issue 60092 * update * update * update * Fix unique (#60840) (#61044) * fix unique kernel, row to num_out * cinn(py-dsl): skip eval string in python-dsl (#61380) (#61586) * remove _wget (#61356) (#61569) * remove _wget * remove _wget * remove wget test * fix layer_norm decompose dtyte bugs, polish codes (#61631) * fix doc style (#61688) * merge (#61866) * [security] refine _get_program_cache_key (#61827) (#61896) * security, refine _get_program_cache_key * repeat_interleave support bf16 dtype (#61854) (#61899) * repeat_interleave support bf16 dtype * support bf16 on cpu * Support Fake GroupWise Quant (#61900) * fix launch when elastic run (#61847) (#61878) * [Paddle-TRT] fix solve (#61806) * [Cherry-Pick] Fix CacheKV Quant Bug (#61966) * fix cachekv quant problem * add unittest * Sychronized the paddle2.4 adaptation changes * clear third_part dependencies * change submodules to right commits * build pass with cpu only * build success with maca * build success with cutlass and fused kernels * build with flash_attn and mccl * build with test, fix some bugs * fix some bugs * fixed some compilation bugs * fix bug in previous commit * fix bug with split when col_size biger than 256 * add row_limit to show full kernel name * add env.sh Change-Id: I6fded2761a44af952a4599691e19a1976bd9b9d1 * add shape record Change-Id: I273f5a5e97e2a31c1c8987ee1c3ce44a6acd6738 * modify paddle version Change-Id: I97384323c38066e22562a6fe8f44b245cbd68f98 * wuzhao optimized the performance of elementwise kernel. Change-Id: I607bc990415ab5ff7fb3337f628b3ac765d3186c * fix split when dtype is fp16 Change-Id: Ia55d31d11e6fa214d555326a553eaee3e928e597 * fix bug in previous commit Change-Id: I0fa66120160374da5a774ef2c04f133a54517069 * adapt flash_attn new capi Change-Id: Ic669be18daee9cecbc8542a14e02cdc4b8d429ba * change eigen path Change-Id: I514c0028e16d19a3084656cc9aa0838a115fc75c * modify mcname -> replaced_name Change-Id: Idc520d2db200ed5aa32da9573b19483d81a0fe9e * fix some build bugs Change-Id: I50067dfa3fcaa019b5736f4426df6d4e5f64107d * add PADDLE_ENABLE_SAME_RAND_A100 Change-Id: I2d4ab6ed0b5fac3568562860b0ba1c4f8e346c61 done * remove redundant warning, add patch from 2.6.1 Change-Id: I958d5bebdc68eb42fe433c76a3737330e00a72aa * improve VectorizedBroadcastKernel (cherry picked from commit 19069b26c0bf05a80cc834162db072f6b8aa2536) Change-Id: Iaf5719d72ab52adbedc40d4788c52eb1ce4d517c Signed-off-by: m00891 <[email protected]> * fix bugs (cherry picked from commit b007853a75dbd5de63028f4af82c15a5d3d81f7c) Change-Id: Iaec0418c384ad2c81c354ef09d81f3e9dfcf82f1 Signed-off-by: m00891 <[email protected]> * split ElementwiseDivGrad (cherry picked from commit eb6470406b7d440c135a3f7ff68fbed9494e9c1f) Change-Id: I60e8912be8f8d40ca83a54af1493adfa2962b2d6 Signed-off-by: m00891 <[email protected]> * in VectorizedElementwiseKernel, it can now use vecSize = 8 (cherry picked from commit a873000a6c3bc9e2540e178d460e74e15a3d4de5) Change-Id: Ia703b1e9e959558988fcd09182387da839d33922 Signed-off-by: m00891 <[email protected]> * improve ModulatedDeformableCol2imCoordGpuKernel:1.block size 512->64;2.FastDivMod;3.fix VL1;4.remove DmcnGetCoordinateWeight divergent branches. (cherry picked from commit 82c914bdd29f0eef87a52b229ff84bc456a1beeb) Change-Id: I60b1fa9a9c89ade25e6b057c38e08616a24fa5e3 Signed-off-by: m00891 <[email protected]> * Optimize depthwise_conv2d_grad compute (InputGrad): 1.use shared memory to optimize data load from global memory; 2.different blocksize for different input shape 3.FastDivMod for input shape div, >> and & for stride div. (cherry picked from commit b34a5634d848f3799f5a8bcf884731dba72d3b20) Change-Id: I0d8f22f2a2b9d99dc9fbfc1fb69b7bed66010229 Signed-off-by: m00891 <[email protected]> * improve VectorizedBroadcastKernel with LoadType = 2(kMixed) (cherry picked from commit 728b9547f65e096b45f39f096783d2bb49e8556f) Change-Id: I282dd8284a7cde54061780a22b397133303f51e5 Signed-off-by: m00891 <[email protected]> * fix ElementwiseDivGrad (cherry picked from commit 5f99c31904e94fd073bdd1696c3431cccaa376cb) Change-Id: I3ae0d6c01eec124d12fa226a002b10d0c40f820c Signed-off-by: m00891 <[email protected]> * Revert "Optimize depthwise_conv2d_grad compute (InputGrad):" This reverts commit b34a5634d848f3799f5a8bcf884731dba72d3b20. (cherry picked from commit 398f5cde81e2131ff7014edfe1d7beaaf806adbb) Change-Id: I637685b91860a7dea6df6cbba0ff2cf31363e766 Signed-off-by: m00891 <[email protected]> * improve ElementwiseDivGrad and ElementwiseMulGrad (cherry picked from commit fe32db418d8f075e083f31dca7010398636a6e67) Change-Id: I4f7e0f2b5afd4e704ffcd7258def63afc43eea9c Signed-off-by: m00891 <[email protected]> * improve FilterBBoxes (cherry picked from commit fe4655e86b92f5053fa886af49bf199307960a05) Change-Id: I35003420292359f8a41b19b7ca2cbaae17dc5b45 Signed-off-by: m00891 <[email protected]> * improve deformable_conv_grad op:1.adaptive block size;2.FastDivMod;3.move ldg up. (cherry picked from commit a7cb0ed275a3488f79445ef31456ab6560e9de43) Change-Id: Ia89df4e5a26de64baae4152837d2ce3076c56df1 Signed-off-by: m00891 <[email protected]> * improve ModulatedDeformableIm2colGpuKernel:1.adaptive block size;2.FastDivMod;3.move ldg up. (cherry picked from commit 4fb857655d09f55783d9445b91a2d953ed14d0b8) Change-Id: I7df7f3af7b4615e5e96d33b439e5276be6ddb732 Signed-off-by: m00891 <[email protected]> * improve KeBNBackwardData:replace 1.0/sqrt with rsqrt (cherry picked from commit 333cba7aca1edf7a0e87623a0e55e230cd1e9451) Change-Id: Ic808d42003677ed543621eb22a797f0ab7751baa Signed-off-by: m00891 <[email protected]> * Improve KeBNBackwardData, FilterGradAddupGpuKernel kernels. Improve nonzero and masked_select (forward only) OP. (cherry picked from commit c907b40eb3f9ded6ee751e522c2a97a353ac93bd) Change-Id: I7f4845405e64e7599134a8c497f464ac04dead88 Signed-off-by: m00891 <[email protected]> * Optimize depthwise_conv2d: 1. 256 Blocksize launch for small shape inputgrad; 2. FastDivMod in inputgrad and filtergrad; 3. shared memory to put output_grad_data in small shape. (cherry picked from commit f9f29bf7b8d929fb95eb1153a79d8a6b96d5b6d2) Change-Id: I1a3818201784031dbedc320286ea5f4802dbb6b1 Signed-off-by: m00891 <[email protected]> * Improve CheckFiniteAndUnscaleKernel by splitting the kernel into multiple tensors. (cherry picked from commit 3bd200f262271a333b3947326442b86af7fb6da1) Change-Id: I57c94cc5e709be8926e1b21da14b653cb18eabc3 Signed-off-by: m00891 <[email protected]> * Revert "Improve CheckFiniteAndUnscaleKernel by splitting the kernel into multiple tensors." This reverts commit 3bd200f262271a333b3947326442b86af7fb6da1. (cherry picked from commit 86ed8adaa8c20d3c824eecb0ee1e10d365bcea37) Change-Id: I5b8b7819fdf99255c65fe832d5d77f8e439bdecb Signed-off-by: m00891 <[email protected]> * improve ScatterInitCUDAKernel and ScatterCUDAKernel (cherry picked from commit cddb01a83411c45f68363248291c0c4685e60b24) Change-Id: Ie106ff8d65c21a8545c40636f021b73f3ad84587 Signed-off-by: m00891 <[email protected]> * fix bugs and make the code easier to read (cherry picked from commit 07ea3acf347fda434959c8c9cc3533c0686d1836) Change-Id: Id7a727fd18fac4a662f8af1bf6c6b5ebc6233c9f Signed-off-by: m00891 <[email protected]> * Optimize FilterGard and InputGradSpL Use tmp to store ldg data in the loop so calculate and ldg time can fold each other. (cherry picked from commit 7ddab49d868cdb6deb7c3e17c5ef9bbdbab86c3e) Change-Id: I46399594d1d7f76b78b9860e483716fdae8fc7d6 Signed-off-by: m00891 <[email protected]> * Improve CheckFiniteAndUnscaleKernel by putting address access to shared memory and making single thread do more tasks. (cherry picked from commit 631ffdda2847cda9562e591dc87b3f529a51a978) Change-Id: Ie9ffdd872ab06ff34d4daf3134d6744f5221e41e Signed-off-by: m00891 <[email protected]> * Optimize SwinTransformer 1.LayerNormBackward: remove if statement, now will always loop VPT times for ldg128 in compiler, bool flag to control if write action will be taken or not; 2.ContiguousCaseOneFunc: tmp saving division result for less division (cherry picked from commit 422d676507308d26f6107bed924424166aa350d3) Change-Id: I37aab7e2f97ae6b61c0f50ae4134f5eb1743d429 Signed-off-by: m00891 <[email protected]> * Optimize LayerNormBackwardComputeGradInputWithSmallFeatureSize Set BlockDim.z to make blockSize always be 512, each block can handle several batches. Then all threads will loop 4 times for better performance. (cherry picked from commit 7550c90ca29758952fde13eeea74857ece41908b) Change-Id: If24de87a0af19ee07e29ac2e7e237800f0181148 Signed-off-by: m00891 <[email protected]> * improve KeMatrixTopK:1.fix private memory;2.modify max grid size;3.change it to 64 warp reduce. (cherry picked from commit a346af182b139dfc7737e5f6473dc394b21635d7) Change-Id: I6c8d8105fd77947c662e6d22a0d15d7bad076bde Signed-off-by: m00891 <[email protected]> * Modify LayerNorm Optimization Might have lossdiff with old optimization without atomicAdd. (cherry picked from commit 80b0bcaa9a307c94dbeda658236fd75e104ccccc) Change-Id: I4a7c4ec2a0e885c2d581dcebc74464830dae7637 Signed-off-by: m00891 <[email protected]> * improve roi_align op:1.adaptive block size;2.FastDivMod. (cherry picked from commit cc421d7861c359740de0d2870abcfde4354d8c71) Change-Id: I55c049e951f93782af1c374331f44b521ed75dfe Signed-off-by: m00891 <[email protected]> * add workaround for parameters dislocation when calling BatchedGEMM<float16>. Change-Id: I5788c73a9c45f65e60ed5a88d16a473bbb888927 * fix McFlashAttn string Change-Id: I8b34f02958ddccb3467f639daaac8044022f3d34 * [C500-27046] fix wb issue Change-Id: I77730da567903f43ef7a9992925b90ed4ba179c7 * Support compiling external ops Change-Id: I1b7eb58e7959daff8660ce7889ba390cdfae0c1a * support flash attn varlen api and support arm build Change-Id: I94d422c969bdb83ad74262e03efe38ca85ffa673 * Add a copyright notice Change-Id: I8ece364d926596a40f42d973190525d9b8224d99 * Modify some third-party dependency addresses to public network addresses --------- Signed-off-by: m00891 <[email protected]> Co-authored-by: risemeup1 <[email protected]> Co-authored-by: Nyakku Shigure <[email protected]> Co-authored-by: gouzil <[email protected]> Co-authored-by: Wang Bojun <[email protected]> Co-authored-by: lizexu123 <[email protected]> Co-authored-by: danleifeng <[email protected]> Co-authored-by: Vigi Zhang <[email protected]> Co-authored-by: tianhaodongbd <[email protected]> Co-authored-by: zyfncg <[email protected]> Co-authored-by: JYChen <[email protected]> Co-authored-by: zhaohaixu <[email protected]> Co-authored-by: Spelling <[email protected]> Co-authored-by: zhouzj <[email protected]> Co-authored-by: wanghuancoder <[email protected]> Co-authored-by: ndren <[email protected]> Co-authored-by: Nguyen Cong Vinh <[email protected]> Co-authored-by: Ruibin Cheung <[email protected]> Co-authored-by: Tian <[email protected]> Co-authored-by: Yuanle Liu <[email protected]> Co-authored-by: zhuyipin <[email protected]> Co-authored-by: 6clc <[email protected]> Co-authored-by: Wenyu <[email protected]> Co-authored-by: Xianduo Li <[email protected]> Co-authored-by: Wang Xin <[email protected]> Co-authored-by: Chang Xu <[email protected]> Co-authored-by: wentao yu <[email protected]> Co-authored-by: zhink <[email protected]> Co-authored-by: handiz <[email protected]> Co-authored-by: zhimin Pan <[email protected]> Co-authored-by: m00891 <[email protected]> Co-authored-by: shuliu <[email protected]> Co-authored-by: Yanxin Zhou <[email protected]> Co-authored-by: Zhao Wu <[email protected]> Co-authored-by: m00932 <[email protected]> Co-authored-by: Fangzhou Feng <[email protected]> Co-authored-by: junwang <[email protected]> Co-authored-by: m01097 <[email protected]>
1 parent e032331 commit b102bc4

File tree

310 files changed

+8435
-4921
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

310 files changed

+8435
-4921
lines changed

.gitmodules

Lines changed: 13 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
[submodule "third_party/protobuf"]
22
path = third_party/protobuf
33
url = https://github.com/protocolbuffers/protobuf.git
4+
tag = paddle
45
ignore = dirty
56
[submodule "third_party/pocketfft"]
67
path = third_party/pocketfft
@@ -21,10 +22,11 @@
2122
[submodule "third_party/utf8proc"]
2223
path = third_party/utf8proc
2324
url = https://github.com/JuliaStrings/utf8proc.git
25+
tag = v2.6.1
2426
ignore = dirty
2527
[submodule "third_party/warpctc"]
2628
path = third_party/warpctc
27-
url = https://github.com/baidu-research/warp-ctc.git
29+
url = http://pdegit.metax-internal.com/pde-ai/warp-ctc.git
2830
ignore = dirty
2931
[submodule "third_party/warprnnt"]
3032
path = third_party/warprnnt
@@ -33,10 +35,12 @@
3335
[submodule "third_party/xxhash"]
3436
path = third_party/xxhash
3537
url = https://github.com/Cyan4973/xxHash.git
38+
tag = v0.6.5
3639
ignore = dirty
3740
[submodule "third_party/pybind"]
3841
path = third_party/pybind
3942
url = https://github.com/pybind/pybind11.git
43+
tag = v2.4.3
4044
ignore = dirty
4145
[submodule "third_party/threadpool"]
4246
path = third_party/threadpool
@@ -45,39 +49,25 @@
4549
[submodule "third_party/zlib"]
4650
path = third_party/zlib
4751
url = https://github.com/madler/zlib.git
52+
tag = v1.2.8
4853
ignore = dirty
4954
[submodule "third_party/glog"]
5055
path = third_party/glog
5156
url = https://github.com/google/glog.git
5257
ignore = dirty
53-
[submodule "third_party/eigen3"]
54-
path = third_party/eigen3
55-
url = https://gitlab.com/libeigen/eigen.git
56-
ignore = dirty
5758
[submodule "third_party/snappy"]
5859
path = third_party/snappy
5960
url = https://github.com/google/snappy.git
6061
ignore = dirty
61-
[submodule "third_party/cub"]
62-
path = third_party/cub
63-
url = https://github.com/NVIDIA/cub.git
64-
ignore = dirty
65-
[submodule "third_party/cutlass"]
66-
path = third_party/cutlass
67-
url = https://github.com/NVIDIA/cutlass.git
68-
ignore = dirty
6962
[submodule "third_party/xbyak"]
7063
path = third_party/xbyak
7164
url = https://github.com/herumi/xbyak.git
65+
tag = v5.81
7266
ignore = dirty
7367
[submodule "third_party/mkldnn"]
7468
path = third_party/mkldnn
7569
url = https://github.com/oneapi-src/oneDNN.git
7670
ignore = dirty
77-
[submodule "third_party/flashattn"]
78-
path = third_party/flashattn
79-
url = https://github.com/PaddlePaddle/flash-attention.git
80-
ignore = dirty
8171
[submodule "third_party/gtest"]
8272
path = third_party/gtest
8373
url = https://github.com/google/googletest.git
@@ -98,15 +88,11 @@
9888
path = third_party/rocksdb
9989
url = https://github.com/Thunderbrook/rocksdb
10090
ignore = dirty
101-
[submodule "third_party/absl"]
102-
path = third_party/absl
103-
url = https://github.com/abseil/abseil-cpp.git
104-
ignore = dirty
105-
[submodule "third_party/jitify"]
106-
path = third_party/jitify
107-
url = https://github.com/NVIDIA/jitify.git
91+
[submodule "third_party/cutlass"]
92+
path = third_party/cutlass
93+
url = http://pdegit.metax-internal.com/pde-ai/cutlass.git
10894
ignore = dirty
109-
[submodule "third_party/cccl"]
110-
path = third_party/cccl
111-
url = https://github.com/NVIDIA/cccl.git
95+
[submodule "third_party/eigen3"]
96+
path = third_party/eigen3
97+
url = ssh://gerrit.metax-internal.com:29418/MACA/library/mcEigen
11298
ignore = dirty

CMakeLists.txt

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# 2024 - Modified by MetaX Integrated Circuits (Shanghai) Co., Ltd. All Rights Reserved.
12
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
23
#
34
# Licensed under the Apache License, Version 2.0 (the "License");
@@ -24,7 +25,7 @@ endif()
2425
# https://cmake.org/cmake/help/v3.0/policy/CMP0026.html?highlight=cmp0026
2526
cmake_policy(SET CMP0026 OLD)
2627
cmake_policy(SET CMP0079 NEW)
27-
set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_CURRENT_SOURCE_DIR}/cmake")
28+
set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_CURRENT_SOURCE_DIR}/cmake" $ENV{CMAKE_MODULE_PATH})
2829
set(PADDLE_SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR})
2930
set(PADDLE_BINARY_DIR ${CMAKE_CURRENT_BINARY_DIR})
3031

@@ -92,6 +93,7 @@ endif()
9293

9394
if(WITH_GPU AND NOT APPLE)
9495
enable_language(CUDA)
96+
set(CMAKE_CUDA_COMPILER_VERSION 11.6)
9597
message(STATUS "CUDA compiler: ${CMAKE_CUDA_COMPILER}, version: "
9698
"${CMAKE_CUDA_COMPILER_ID} ${CMAKE_CUDA_COMPILER_VERSION}")
9799
endif()
@@ -255,7 +257,7 @@ option(WITH_SYSTEM_BLAS "Use system blas library" OFF)
255257
option(WITH_DISTRIBUTE "Compile with distributed support" OFF)
256258
option(WITH_BRPC_RDMA "Use brpc rdma as the rpc protocal" OFF)
257259
option(ON_INFER "Turn on inference optimization and inference-lib generation"
258-
ON)
260+
OFF)
259261
option(WITH_CPP_DIST "Install PaddlePaddle C++ distribution" OFF)
260262
option(WITH_GFLAGS "Compile PaddlePaddle with gflags support" OFF)
261263
################################ Internal Configurations #######################################
@@ -283,7 +285,7 @@ option(
283285
OFF)
284286
option(WITH_LITE "Compile Paddle Fluid with Lite Engine" OFF)
285287
option(WITH_CINN "Compile PaddlePaddle with CINN" OFF)
286-
option(WITH_NCCL "Compile PaddlePaddle with NCCL support" ON)
288+
option(WITH_NCCL "Compile PaddlePaddle with NCCL support" OFF)
287289
option(WITH_RCCL "Compile PaddlePaddle with RCCL support" ON)
288290
option(WITH_XPU_BKCL "Compile PaddlePaddle with BAIDU KUNLUN XPU BKCL" OFF)
289291
option(WITH_CRYPTO "Compile PaddlePaddle with crypto support" ON)
@@ -474,6 +476,21 @@ if(WITH_GPU)
474476
# so include(cudnn) needs to be in front of include(third_party/lite)
475477
include(cudnn) # set cudnn libraries, must before configure
476478
include(tensorrt)
479+
480+
include_directories("$ENV{MACA_PATH}/tools/cu-bridge/include")
481+
include_directories("$ENV{MACA_PATH}/include")
482+
include_directories("$ENV{MACA_PATH}/include/mcblas")
483+
include_directories("$ENV{MACA_PATH}/include/mcr")
484+
include_directories("$ENV{MACA_PATH}/include/mcdnn")
485+
include_directories("$ENV{MACA_PATH}/include/mcsim")
486+
include_directories("$ENV{MACA_PATH}/include/mcsparse")
487+
include_directories("$ENV{MACA_PATH}/include/mcfft")
488+
include_directories("$ENV{MACA_PATH}/include/mcrand")
489+
include_directories("$ENV{MACA_PATH}/include/common")
490+
include_directories("$ENV{MACA_PATH}/include/mcsolver")
491+
include_directories("$ENV{MACA_PATH}/include/mctx")
492+
include_directories("$ENV{MACA_PATH}/include/mcpti")
493+
include_directories("$ENV{MACA_PATH}/mxgpu_llvm/include")
477494
# there is no official support of nccl, cupti in windows
478495
if(NOT WIN32)
479496
include(cupti)

NOTICE

Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
The following files may have been modified by MetaX Integrated Circuits (Shanghai) Co., Ltd. in 2024.
2+
3+
.gitmodules
4+
CMakeLists.txt
5+
cmake/cuda.cmake
6+
cmake/cudnn.cmake
7+
cmake/cupti.cmake
8+
cmake/external/brpc.cmake
9+
cmake/external/cryptopp.cmake
10+
cmake/external/cutlass.cmake
11+
cmake/external/dgc.cmake
12+
cmake/external/dlpack.cmake
13+
cmake/external/eigen.cmake
14+
cmake/external/flashattn.cmake
15+
cmake/external/jemalloc.cmake
16+
cmake/external/lapack.cmake
17+
cmake/external/libmct.cmake
18+
cmake/external/mklml.cmake
19+
cmake/external/protobuf.cmake
20+
cmake/external/pybind11.cmake
21+
cmake/external/utf8proc.cmake
22+
cmake/flags.cmake
23+
cmake/generic.cmake
24+
cmake/inference_lib.cmake
25+
cmake/nccl.cmake
26+
cmake/third_party.cmake
27+
env.sh
28+
paddle/fluid/distributed/fleet_executor/test/interceptor_ping_pong_with_brpc_test.cc
29+
paddle/fluid/eager/api/manual/eager_manual/forwards/multiply_fwd_func.cc
30+
paddle/fluid/eager/auto_code_generator/eager_generator.cc
31+
paddle/fluid/eager/auto_code_generator/generator/eager_gen.py
32+
paddle/fluid/framework/details/build_strategy.cc
33+
paddle/fluid/framework/distributed_strategy.proto
34+
paddle/fluid/inference/api/resource_manager.cc
35+
paddle/fluid/inference/api/resource_manager.h
36+
paddle/fluid/inference/tensorrt/plugin/layernorm_shift_partition_op.cu
37+
paddle/fluid/inference/tensorrt/plugin/matmul_op_int8_plugin.h
38+
paddle/fluid/inference/tensorrt/plugin/preln_residual_bias_plugin.cu
39+
paddle/fluid/memory/allocation/CMakeLists.txt
40+
paddle/fluid/memory/allocation/allocator_facade.cc
41+
paddle/fluid/operators/CMakeLists.txt
42+
paddle/fluid/operators/correlation_op.cu
43+
paddle/fluid/operators/elementwise/elementwise_op_function.h
44+
paddle/fluid/operators/fused/CMakeLists.txt
45+
paddle/fluid/operators/fused/attn_gemm_int8.h
46+
paddle/fluid/operators/fused/cublaslt.h
47+
paddle/fluid/operators/fused/fused_gate_attention.h
48+
paddle/fluid/operators/fused/fused_gemm_epilogue_op.cu
49+
paddle/fluid/operators/fused/fused_layernorm_residual_dropout_bias.h
50+
paddle/fluid/operators/fused/fused_multi_transformer_int8_op.cu
51+
paddle/fluid/operators/fused/fused_multi_transformer_op.cu
52+
paddle/fluid/operators/fused/fused_multi_transformer_op.cu.h
53+
paddle/fluid/operators/fused/fused_softmax_mask.cu.h
54+
paddle/fluid/operators/math/inclusive_scan.h
55+
paddle/fluid/operators/matmul_op.cc
56+
paddle/fluid/operators/row_conv_op.cu
57+
paddle/fluid/operators/sparse_attention_op.cu
58+
paddle/fluid/platform/cuda_graph_with_memory_pool.cc
59+
paddle/fluid/platform/device/gpu/cuda/cuda_helper.h
60+
paddle/fluid/platform/device/gpu/cuda_helper_test.cu
61+
paddle/fluid/platform/device/gpu/gpu_types.h
62+
paddle/fluid/platform/device_context.h
63+
paddle/fluid/platform/dynload/CMakeLists.txt
64+
paddle/fluid/platform/dynload/cublas.h
65+
paddle/fluid/platform/dynload/cublasLt.cc
66+
paddle/fluid/platform/dynload/cublasLt.h
67+
paddle/fluid/platform/dynload/cusparseLt.h
68+
paddle/fluid/platform/init.cc
69+
paddle/fluid/platform/init_phi_test.cc
70+
paddle/fluid/pybind/eager_legacy_op_function_generator.cc
71+
paddle/fluid/pybind/fleet_py.cc
72+
paddle/fluid/pybind/pybind.cc
73+
paddle/phi/api/profiler/profiler.cc
74+
paddle/phi/backends/dynload/CMakeLists.txt
75+
paddle/phi/backends/dynload/cublas.h
76+
paddle/phi/backends/dynload/cublasLt.cc
77+
paddle/phi/backends/dynload/cublasLt.h
78+
paddle/phi/backends/dynload/cuda_driver.h
79+
paddle/phi/backends/dynload/cudnn.h
80+
paddle/phi/backends/dynload/cufft.h
81+
paddle/phi/backends/dynload/cupti.h
82+
paddle/phi/backends/dynload/curand.h
83+
paddle/phi/backends/dynload/cusolver.h
84+
paddle/phi/backends/dynload/cusparse.h
85+
paddle/phi/backends/dynload/cusparseLt.h
86+
paddle/phi/backends/dynload/dynamic_loader.cc
87+
paddle/phi/backends/dynload/flashattn.h
88+
paddle/phi/backends/dynload/nccl.h
89+
paddle/phi/backends/dynload/nvjpeg.h
90+
paddle/phi/backends/dynload/nvrtc.h
91+
paddle/phi/backends/dynload/nvtx.h
92+
paddle/phi/backends/gpu/cuda/cuda_device_function.h
93+
paddle/phi/backends/gpu/cuda/cuda_helper.h
94+
paddle/phi/backends/gpu/forwards.h
95+
paddle/phi/backends/gpu/gpu_context.cc
96+
paddle/phi/backends/gpu/gpu_context.h
97+
paddle/phi/backends/gpu/gpu_decls.h
98+
paddle/phi/backends/gpu/gpu_resources.cc
99+
paddle/phi/backends/gpu/gpu_resources.h
100+
paddle/phi/backends/gpu/rocm/rocm_device_function.h
101+
paddle/phi/core/custom_kernel.cc
102+
paddle/phi/core/distributed/check/nccl_dynamic_check.h
103+
paddle/phi/core/distributed/comm_context_manager.h
104+
paddle/phi/core/enforce.h
105+
paddle/phi/core/flags.cc
106+
paddle/phi/core/visit_type.h
107+
paddle/phi/kernels/funcs/aligned_vector.h
108+
paddle/phi/kernels/funcs/blas/blas_impl.cu.h
109+
paddle/phi/kernels/funcs/blas/blaslt_impl.cu.h
110+
paddle/phi/kernels/funcs/broadcast_function.h
111+
paddle/phi/kernels/funcs/concat_and_split_functor.cu
112+
paddle/phi/kernels/funcs/cublaslt.h
113+
paddle/phi/kernels/funcs/deformable_conv_functor.cu
114+
paddle/phi/kernels/funcs/distribution_helper.h
115+
paddle/phi/kernels/funcs/dropout_impl.cu.h
116+
paddle/phi/kernels/funcs/elementwise_base.h
117+
paddle/phi/kernels/funcs/elementwise_grad_base.h
118+
paddle/phi/kernels/funcs/fused_gemm_epilogue.h
119+
paddle/phi/kernels/funcs/gemm_int8_helper.h
120+
paddle/phi/kernels/funcs/inclusive_scan.h
121+
paddle/phi/kernels/funcs/layer_norm_impl.cu.h
122+
paddle/phi/kernels/funcs/math_cuda_utils.h
123+
paddle/phi/kernels/funcs/reduce_function.h
124+
paddle/phi/kernels/funcs/scatter.cu.h
125+
paddle/phi/kernels/funcs/top_k_function_cuda.h
126+
paddle/phi/kernels/funcs/weight_only_gemv.cu
127+
paddle/phi/kernels/fusion/cutlass/utils/cuda_utils.h
128+
paddle/phi/kernels/fusion/gpu/attn_gemm.h
129+
paddle/phi/kernels/fusion/gpu/fused_dropout_add_utils.h
130+
paddle/phi/kernels/fusion/gpu/fused_dropout_helper.h
131+
paddle/phi/kernels/fusion/gpu/fused_layernorm_residual_dropout_bias.h
132+
paddle/phi/kernels/fusion/gpu/fused_linear_param_grad_add_kernel.cu
133+
paddle/phi/kernels/fusion/gpu/fused_softmax_mask_upper_triangle_utils.h
134+
paddle/phi/kernels/fusion/gpu/fused_softmax_mask_utils.h
135+
paddle/phi/kernels/fusion/gpu/mmha_util.cu.h
136+
paddle/phi/kernels/gpu/accuracy_kernel.cu
137+
paddle/phi/kernels/gpu/amp_kernel.cu
138+
paddle/phi/kernels/gpu/batch_norm_grad_kernel.cu
139+
paddle/phi/kernels/gpu/contiguous_kernel.cu
140+
paddle/phi/kernels/gpu/decode_jpeg_kernel.cu
141+
paddle/phi/kernels/gpu/deformable_conv_grad_kernel.cu
142+
paddle/phi/kernels/gpu/depthwise_conv.h
143+
paddle/phi/kernels/gpu/dist_kernel.cu
144+
paddle/phi/kernels/gpu/flash_attn_grad_kernel.cu
145+
paddle/phi/kernels/gpu/flash_attn_kernel.cu
146+
paddle/phi/kernels/gpu/flash_attn_utils.h
147+
paddle/phi/kernels/gpu/gelu_funcs.h
148+
paddle/phi/kernels/gpu/generate_proposals_kernel.cu
149+
paddle/phi/kernels/gpu/group_norm_kernel.cu
150+
paddle/phi/kernels/gpu/interpolate_grad_kernel.cu
151+
paddle/phi/kernels/gpu/kthvalue_kernel.cu
152+
paddle/phi/kernels/gpu/llm_int8_linear_kernel.cu
153+
paddle/phi/kernels/gpu/masked_select_kernel.cu
154+
paddle/phi/kernels/gpu/nonzero_kernel.cu
155+
paddle/phi/kernels/gpu/roi_align_grad_kernel.cu
156+
paddle/phi/kernels/gpu/roi_align_kernel.cu
157+
paddle/phi/kernels/gpu/strided_copy_kernel.cu
158+
paddle/phi/kernels/gpu/top_k_kernel.cu
159+
paddle/phi/kernels/gpu/top_p_sampling_kernel.cu
160+
paddle/phi/kernels/gpu/unique_consecutive_functor.h
161+
paddle/phi/kernels/gpu/unique_kernel.cu
162+
paddle/phi/kernels/gpudnn/conv_cudnn_v7.h
163+
paddle/phi/kernels/gpudnn/softmax_gpudnn.h
164+
paddle/phi/kernels/impl/deformable_conv_grad_kernel_impl.h
165+
paddle/phi/kernels/impl/llm_int8_matmul_kernel_impl.h
166+
paddle/phi/kernels/impl/matmul_kernel_impl.h
167+
paddle/phi/kernels/impl/multi_dot_kernel_impl.h
168+
paddle/phi/kernels/primitive/datamover_primitives.h
169+
paddle/phi/kernels/primitive/kernel_primitives.h
170+
paddle/phi/tools/CMakeLists.txt
171+
paddle/utils/flat_hash_map.h
172+
patches/eigen/TensorReductionGpu.h
173+
python/paddle/base/framework.py
174+
python/paddle/distributed/launch/controllers/watcher.py
175+
python/paddle/profiler/profiler_statistic.py
176+
python/paddle/utils/cpp_extension/cpp_extension.py
177+
python/paddle/utils/cpp_extension/extension_utils.py
178+
test/CMakeLists.txt
179+
test/cpp/CMakeLists.txt
180+
test/cpp/jit/CMakeLists.txt
181+
test/cpp/new_executor/CMakeLists.txt
182+
test/legacy_test/test_flash_attention.py
183+
tools/ci_op_benchmark.sh

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ PaddlePaddle is originated from industrial practices with dedication and commitm
2020

2121
## Installation
2222

23-
### Latest PaddlePaddle Release: [v2.5](https://github.com/PaddlePaddle/Paddle/tree/release/2.5)
23+
### Latest PaddlePaddle Release: [v2.6](https://github.com/PaddlePaddle/Paddle/tree/release/2.6)
2424

2525
Our vision is to enable deep learning for everyone via PaddlePaddle.
2626
Please refer to our [release announcement](https://github.com/PaddlePaddle/Paddle/releases) to track the latest features of PaddlePaddle.

README_cn.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,9 @@
1818

1919
## 安装
2020

21-
### PaddlePaddle最新版本: [v2.5](https://github.com/PaddlePaddle/Paddle/tree/release/2.5)
21+
### PaddlePaddle 最新版本: [v2.6](https://github.com/PaddlePaddle/Paddle/tree/release/2.6)
2222

23-
跟进PaddlePaddle最新特性请参考我们的[版本说明](https://github.com/PaddlePaddle/Paddle/releases)
23+
跟进 PaddlePaddle 最新特性请参考我们的[版本说明](https://github.com/PaddlePaddle/Paddle/releases)
2424

2525
### 安装最新稳定版本:
2626
```

README_ja.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ PaddlePaddle は、工業化に対するコミットメントを持つ工業的
2020

2121
## インストール
2222

23-
### PaddlePaddle の最新リリース: [v2.5](https://github.com/PaddlePaddle/Paddle/tree/release/2.5)
23+
### PaddlePaddle の最新リリース: [v2.6](https://github.com/PaddlePaddle/Paddle/tree/release/2.6)
2424

2525
私たちのビジョンは、PaddlePaddle を通じて、誰もが深層学習を行えるようにすることです。
2626
PaddlePaddle の最新機能を追跡するために、私たちの[リリースのお知らせ](https://github.com/PaddlePaddle/Paddle/releases)を参照してください。

0 commit comments

Comments
 (0)