Skip to content

Simplify copy kernel #28428

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
Closed

Conversation

zasdfgbnm
Copy link
Collaborator

@zasdfgbnm zasdfgbnm commented Oct 22, 2019

Stack from ghstack:

Using the new type promotion and dynamic casting added to
TensorIterator, the copy kernels could be greatly simplified.

Benchmark on CUDA:

import torch
import timeit
import pandas
import itertools
from tqdm.notebook import tqdm
import math
print(torch.__version__)
print()

_10M = 10 * 1024 ** 2

d = {}

for from_, to in tqdm(itertools.product(torch.testing.get_all_dtypes(), repeat=2)):
    if from_ not in d:
        d[from_] = {}
    a = torch.empty(_10M, dtype=from_, device='cuda')
    min_ = math.inf
    for i in range(100):
        torch.cuda.synchronize()
        start = timeit.default_timer()
        a.to(to)
        torch.cuda.synchronize()
        end = timeit.default_timer()
        elapsed = end - start
        if elapsed < min_:
            min_ = elapsed
    d[from_][to] = int(min_ * 1000 * 1000)
    
pandas.DataFrame(d)

original:
image

new:
image

Differential Revision: D18170995

zasdfgbnm added a commit that referenced this pull request Oct 22, 2019
Using the new type promotion and dynamic casting added to
`TensorIterator`, the copy kernels could be greatly simplified.

**Script:**
```python
import torch
import timeit
import pandas
import itertools
from tqdm import tqdm
import math
print(torch.__version__)
print()

_10M = 10 * 1024 ** 2

d = {}

for from_, to in tqdm(itertools.product(torch.testing.get_all_dtypes(),
repeat=2)):
    if from_ not in d:
        d[from_] = {}
    a = torch.zeros(_10M, dtype=from_)
    min_ = math.inf
    for i in range(100):
        start = timeit.default_timer()
        a.to(to)
        end = timeit.default_timer()
        elapsed = end - start
        if elapsed < min_:
            min_ = elapsed
    d[from_][to] = int(elapsed * 1000 * 1000)

pandas.DataFrame(d)
```

**Before:**
![image](https://user-images.githubusercontent.com/1032377/67171274-2e93d000-f36b-11e9-8fa0-91edd7dbc8ec.png)

**After:**
![image](https://user-images.githubusercontent.com/1032377/67171200-d361dd80-f36a-11e9-9b22-66292e395a09.png)

ghstack-source-id: 8754f6a
Pull Request resolved: #28428
Using the new type promotion and dynamic casting added to
`TensorIterator`, the copy kernels could be greatly simplified.

**Script:**
```python
import torch
import timeit
import pandas
import itertools
from tqdm import tqdm
import math
print(torch.__version__)
print()

_10M = 10 * 1024 ** 2

d = {}

for from_, to in tqdm(itertools.product(torch.testing.get_all_dtypes(),
repeat=2)):
    if from_ not in d:
        d[from_] = {}
    a = torch.zeros(_10M, dtype=from_)
    min_ = math.inf
    for i in range(100):
        start = timeit.default_timer()
        a.to(to)
        end = timeit.default_timer()
        elapsed = end - start
        if elapsed < min_:
            min_ = elapsed
    d[from_][to] = int(elapsed * 1000 * 1000)

pandas.DataFrame(d)
```

**Before:**
![image](https://user-images.githubusercontent.com/1032377/67171274-2e93d000-f36b-11e9-8fa0-91edd7dbc8ec.png)

**After:**
![image](https://user-images.githubusercontent.com/1032377/67171200-d361dd80-f36a-11e9-9b22-66292e395a09.png)

[ghstack-poisoned]
zasdfgbnm added a commit that referenced this pull request Oct 22, 2019
Using the new type promotion and dynamic casting added to
`TensorIterator`, the copy kernels could be greatly simplified.

**Script:**
```python
import torch
import timeit
import pandas
import itertools
from tqdm import tqdm
import math
print(torch.__version__)
print()

_10M = 10 * 1024 ** 2

d = {}

for from_, to in tqdm(itertools.product(torch.testing.get_all_dtypes(),
repeat=2)):
    if from_ not in d:
        d[from_] = {}
    a = torch.zeros(_10M, dtype=from_)
    min_ = math.inf
    for i in range(100):
        start = timeit.default_timer()
        a.to(to)
        end = timeit.default_timer()
        elapsed = end - start
        if elapsed < min_:
            min_ = elapsed
    d[from_][to] = int(elapsed * 1000 * 1000)

pandas.DataFrame(d)
```

**Before:**
![image](https://user-images.githubusercontent.com/1032377/67171274-2e93d000-f36b-11e9-8fa0-91edd7dbc8ec.png)

**After:**
![image](https://user-images.githubusercontent.com/1032377/67171200-d361dd80-f36a-11e9-9b22-66292e395a09.png)

ghstack-source-id: b764aff
Pull Request resolved: #28428
Using the new type promotion and dynamic casting added to
`TensorIterator`, the copy kernels could be greatly simplified.

For benchmark, see #28352 (comment)

[ghstack-poisoned]
zasdfgbnm added a commit that referenced this pull request Oct 23, 2019
Using the new type promotion and dynamic casting added to
`TensorIterator`, the copy kernels could be greatly simplified.

**Script:**
```python
import torch
import timeit
import pandas
import itertools
from tqdm import tqdm
import math
print(torch.__version__)
print()

_10M = 10 * 1024 ** 2

d = {}

for from_, to in tqdm(itertools.product(torch.testing.get_all_dtypes(),
repeat=2)):
    if from_ not in d:
        d[from_] = {}
    a = torch.zeros(_10M, dtype=from_)
    min_ = math.inf
    for i in range(100):
        start = timeit.default_timer()
        a.to(to)
        end = timeit.default_timer()
        elapsed = end - start
        if elapsed < min_:
            min_ = elapsed
    d[from_][to] = int(elapsed * 1000 * 1000)

pandas.DataFrame(d)
```

**Before:**
![image](https://user-images.githubusercontent.com/1032377/67171274-2e93d000-f36b-11e9-8fa0-91edd7dbc8ec.png)

**After:**
![image](https://user-images.githubusercontent.com/1032377/67171200-d361dd80-f36a-11e9-9b22-66292e395a09.png)

ghstack-source-id: a8356a7
Pull Request resolved: #28428
Using the new type promotion and dynamic casting added to
`TensorIterator`, the copy kernels could be greatly simplified.

For benchmark, see #28352 (comment)

[ghstack-poisoned]
Using the new type promotion and dynamic casting added to
`TensorIterator`, the copy kernels could be greatly simplified.

For benchmark, see #28352 (comment)

[ghstack-poisoned]
zasdfgbnm added a commit that referenced this pull request Oct 24, 2019
Using the new type promotion and dynamic casting added to
`TensorIterator`, the copy kernels could be greatly simplified.

**Script:**
```python
import torch
import timeit
import pandas
import itertools
from tqdm import tqdm
import math
print(torch.__version__)
print()

_10M = 10 * 1024 ** 2

d = {}

for from_, to in tqdm(itertools.product(torch.testing.get_all_dtypes(),
repeat=2)):
    if from_ not in d:
        d[from_] = {}
    a = torch.zeros(_10M, dtype=from_)
    min_ = math.inf
    for i in range(100):
        start = timeit.default_timer()
        a.to(to)
        end = timeit.default_timer()
        elapsed = end - start
        if elapsed < min_:
            min_ = elapsed
    d[from_][to] = int(elapsed * 1000 * 1000)

pandas.DataFrame(d)
```

**Before:**
![image](https://user-images.githubusercontent.com/1032377/67171274-2e93d000-f36b-11e9-8fa0-91edd7dbc8ec.png)

**After:**
![image](https://user-images.githubusercontent.com/1032377/67171200-d361dd80-f36a-11e9-9b22-66292e395a09.png)

ghstack-source-id: 9ebae7b
Pull Request resolved: #28428
Using the new type promotion and dynamic casting added to
`TensorIterator`, the copy kernels could be greatly simplified.

For benchmark, see #28352 (comment)

[ghstack-poisoned]
Using the new type promotion and dynamic casting added to
`TensorIterator`, the copy kernels could be greatly simplified.

For benchmark, see #28352 (comment)

[ghstack-poisoned]
zasdfgbnm added a commit that referenced this pull request Oct 25, 2019
Using the new type promotion and dynamic casting added to
`TensorIterator`, the copy kernels could be greatly simplified.

**Script:**
```python
import torch
import timeit
import pandas
import itertools
from tqdm import tqdm
import math
print(torch.__version__)
print()

_10M = 10 * 1024 ** 2

d = {}

for from_, to in tqdm(itertools.product(torch.testing.get_all_dtypes(),
repeat=2)):
    if from_ not in d:
        d[from_] = {}
    a = torch.zeros(_10M, dtype=from_)
    min_ = math.inf
    for i in range(100):
        start = timeit.default_timer()
        a.to(to)
        end = timeit.default_timer()
        elapsed = end - start
        if elapsed < min_:
            min_ = elapsed
    d[from_][to] = int(elapsed * 1000 * 1000)

pandas.DataFrame(d)
```

**Before:**
![image](https://user-images.githubusercontent.com/1032377/67171274-2e93d000-f36b-11e9-8fa0-91edd7dbc8ec.png)

**After:**
![image](https://user-images.githubusercontent.com/1032377/67171200-d361dd80-f36a-11e9-9b22-66292e395a09.png)

ghstack-source-id: 54b21a3
Pull Request resolved: #28428
Using the new type promotion and dynamic casting added to
`TensorIterator`, the copy kernels could be greatly simplified.

For benchmark, see #28352 (comment)

[ghstack-poisoned]
@zasdfgbnm
Copy link
Collaborator Author

This PR is actually a good test for non-copying type promotion: if someone else breaks the type promotion again by running a .to somewhere, then this version of copy kernel will lead to a dead loop and therefore the problem could be detected easily by the CI.

zasdfgbnm added a commit that referenced this pull request Oct 26, 2019
Using the new type promotion and dynamic casting added to
`TensorIterator`, the copy kernels could be greatly simplified.

**Script:**
```python
import torch
import timeit
import pandas
import itertools
from tqdm import tqdm
import math
print(torch.__version__)
print()

_10M = 10 * 1024 ** 2

d = {}

for from_, to in tqdm(itertools.product(torch.testing.get_all_dtypes(),
repeat=2)):
    if from_ not in d:
        d[from_] = {}
    a = torch.zeros(_10M, dtype=from_)
    min_ = math.inf
    for i in range(100):
        start = timeit.default_timer()
        a.to(to)
        end = timeit.default_timer()
        elapsed = end - start
        if elapsed < min_:
            min_ = elapsed
    d[from_][to] = int(elapsed * 1000 * 1000)

pandas.DataFrame(d)
```

**Before:**
![image](https://user-images.githubusercontent.com/1032377/67171274-2e93d000-f36b-11e9-8fa0-91edd7dbc8ec.png)

**After:**
![image](https://user-images.githubusercontent.com/1032377/67171200-d361dd80-f36a-11e9-9b22-66292e395a09.png)

ghstack-source-id: 3fb1c46
Pull Request resolved: #28428
Using the new type promotion and dynamic casting added to
`TensorIterator`, the copy kernels could be greatly simplified.

For benchmark, see #28352 (comment)

[ghstack-poisoned]
zasdfgbnm added a commit that referenced this pull request Oct 26, 2019
Using the new type promotion and dynamic casting added to
`TensorIterator`, the copy kernels could be greatly simplified.

**Script:**
```python
import torch
import timeit
import pandas
import itertools
from tqdm import tqdm
import math
print(torch.__version__)
print()

_10M = 10 * 1024 ** 2

d = {}

for from_, to in tqdm(itertools.product(torch.testing.get_all_dtypes(),
repeat=2)):
    if from_ not in d:
        d[from_] = {}
    a = torch.zeros(_10M, dtype=from_)
    min_ = math.inf
    for i in range(100):
        start = timeit.default_timer()
        a.to(to)
        end = timeit.default_timer()
        elapsed = end - start
        if elapsed < min_:
            min_ = elapsed
    d[from_][to] = int(elapsed * 1000 * 1000)

pandas.DataFrame(d)
```

**Before:**
![image](https://user-images.githubusercontent.com/1032377/67171274-2e93d000-f36b-11e9-8fa0-91edd7dbc8ec.png)

**After:**
![image](https://user-images.githubusercontent.com/1032377/67171200-d361dd80-f36a-11e9-9b22-66292e395a09.png)

ghstack-source-id: 1269ecc
Pull Request resolved: #28428
Using the new type promotion and dynamic casting added to
`TensorIterator`, the copy kernels could be greatly simplified.

For benchmark, see #28352 (comment)

[ghstack-poisoned]
// this is done intentionally done after build because copy has a "promotion"
// rule that always "promote" to target dtype.
iter.promote_common_dtype();
AT_DISPATCH_ALL_TYPES_AND3(kHalf, kBool, kBFloat16, iter.dtype(0), "copy_", [&] {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a PR trying to enable BFloat16 for CUDA:
https://github.com/pytorch/pytorch/pull/27259/files#diff-6684cb81a1865b7d52d9f2f1789cd0ceR68-R70
So I include it in this PR also, so that I can benchmark

@zasdfgbnm
Copy link
Collaborator Author

There is a small regression on GPU, so I am not sure if this should merge.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Oct 28, 2019
Summary:
Pull Request resolved: pytorch/pytorch#28428

Using the new type promotion and dynamic casting added to
`TensorIterator`, the copy kernels could be greatly simplified.

Benchmark on CUDA:
```python
import torch
import timeit
import pandas
import itertools
from tqdm.notebook import tqdm
import math
print(torch.__version__)
print()

_10M = 10 * 1024 ** 2

d = {}

for from_, to in tqdm(itertools.product(torch.testing.get_all_dtypes(), repeat=2)):
    if from_ not in d:
        d[from_] = {}
    a = torch.empty(_10M, dtype=from_, device='cuda')
    min_ = math.inf
    for i in range(100):
        torch.cuda.synchronize()
        start = timeit.default_timer()
        a.to(to)
        torch.cuda.synchronize()
        end = timeit.default_timer()
        elapsed = end - start
        if elapsed < min_:
            min_ = elapsed
    d[from_][to] = int(min_ * 1000 * 1000)

pandas.DataFrame(d)
```

original:
![image](https://user-images.githubusercontent.com/1032377/67623519-e3e6dd80-f7da-11e9-86ea-9cc9f237123b.png)

new:
![image](https://user-images.githubusercontent.com/1032377/67623527-fc56f800-f7da-11e9-82bd-dc1ff9821b68.png)

Test Plan: Imported from OSS

Differential Revision: D18170995

Pulled By: ezyang

fbshipit-source-id: 461b53641813dc6cfa872a094ae917e750c60759
@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in 5c5b2c6.

@zasdfgbnm zasdfgbnm deleted the gh/zasdfgbnm/13/head branch October 29, 2019 04:13
@vishwakftw
Copy link
Contributor

Did this break Windows build?

@zasdfgbnm
Copy link
Collaborator Author

@vishwakftw Seems that we need to add --expt-extended-lambda somewhere?

@ezyang
Copy link
Contributor

ezyang commented Oct 29, 2019

I'm unlanding the stack

@zasdfgbnm zasdfgbnm restored the gh/zasdfgbnm/13/head branch October 29, 2019 17:46
@zasdfgbnm zasdfgbnm reopened this Oct 29, 2019
@zasdfgbnm zasdfgbnm closed this Oct 29, 2019
@facebook-github-bot facebook-github-bot deleted the gh/zasdfgbnm/13/head branch November 2, 2019 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants