Enabling EmbeddingQuantizer and SharedEmbeddingQuantizer #1525

dillondesilva · 2025-04-13T11:33:23Z

Overview

This PR enables the use of EmbeddingQuantizer and SharedEmbeddingQuantizer as quantization configuration options.

Running lintrunner appears to have changed several lines in this file. However, the edits made strictly to enable these new experimental quantizer can be found on the following lines:

Lines 46-49: Imports for EmbeddingQuantizer and SharedEmbeddingQuantizer
Lines 202-234: Logic for setting EmbeddingQuantizer and SharedEmbeddingQuantizer options
Lines 1033-1034: Including options to map quantization config types to the corresponding quantizers.

…aredEmbeddingQuantizer

pytorch-bot · 2025-04-13T11:33:27Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1525

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job, 1 Unrelated Failure

As of commit 5dbf1ab with merge base a37b08a ():

NEW FAILURES - The following jobs have failed:

pull / runner-et (16-core-ubuntu) (gh)
Process completed with exit code 1.
pull / test-build-runner-et-android / linux-job (gh)
RuntimeError: Command docker exec -t 871faa4d108f116720a1b6fc94cc7623317bf604356a616f5a2e52f610d404cb /exec failed with exit code 1

CANCELLED JOB - The following job was cancelled. Please retry:

pull / runner-et (macos-14-xlarge) (gh)
Process completed with exit code 1.

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-torchao-experimental-et (macos-14-xlarge) (gh) (trunk failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Jack-Khuu · 2025-04-15T17:10:35Z

Looks like the imports aren't happy. I wonder if we need a torchao pin bump?
Wanna give that a try?

torchchat/utils/quantize.py

metascroy · 2025-04-15T21:40:49Z

torchchat/utils/quantize.py

                weight_dtype = getattr(torch, f"int{bit_width}")

                try:
                    quantize_(
-                        model, 
+                        model,
                        int8_dynamic_activation_intx_weight(
                            weight_dtype=weight_dtype,
                            granularity=granularity,


granularity => weight_granularity
has_weight_zeros=True => weight_mapping_type=MappingType.ASYMMETRIC
has_weight_zeros=False => weight_mapping_type=MappingType.SYMMETRIC

metascroy · 2025-04-15T21:41:23Z

torchchat/utils/quantize.py

@@ -154,45 +170,86 @@ def quantize_model(
                    print("Encountered error during quantization: {e}")
                    print("Trying with PlainLayout")


Use QDQLayout instead

torchchat/utils/quantize.py

metascroy · 2025-04-15T21:47:27Z

Looks like the imports aren't happy. I wonder if we need a torchao pin bump? Wanna give that a try?

Yeah, you will need to update the torchao pin to something more recent (just pick the latest commit in torchao): https://github.com/pytorch/torchchat/blob/main/install/.pins/torchao-pin.txt

…und of PR comments. Fixes to usage of EmbeddingQuantizer and SharedEmbeddingQuantizer

…uant

dillondesilva · 2025-04-24T10:34:18Z

@Jack-Khuu I think its ready to be merged (hopefully haha). Thanks so much for helping out - really appreciate the support from you and Scott :)

Jack-Khuu

Looks legit, just some small style nits

@metascroy can you give it a glance?

Jack-Khuu · 2025-04-24T22:02:13Z

torchchat/utils/quantize.py

+
 import inspect

+
 def get_named_parameters(func: Callable) -> List[str]:
    # Get the signature of the function
+
    signature = inspect.signature(func)

    # Extract the parameters from the signature
+
    parameters = signature.parameters

    # Filter and return named parameters
+


Mind undoing the whitespaces?

Jack-Khuu · 2025-04-24T22:03:27Z

torchchat/utils/quantize.py

@@ -110,23 +125,73 @@ def quantize_model(

    if isinstance(quantize_options, str):
        quantize_options = json.loads(quantize_options)
-
    for quantizer, q_kwargs in quantize_options.items():
        if quantizer not in quantizer_class_dict:
            raise RuntimeError(f"unknown quantizer {quantizer} specified")
        else:
            # Use tensor subclass API for int4 weight only.


Whoops, Comment got split off from:if (device in ["cuda", "xpu", "npu"]) and quantizer == "linear:int4":

Jack-Khuu · 2025-04-24T22:04:59Z

torchchat/utils/quantize.py

    # default setup for affine quantization of activations
+


Ditto on the white spaces :)

torchchat/utils/quantize.py

metascroy · 2025-04-25T00:07:40Z

The main concern I have is shared embedding quantization must be done first. Not sure how to ensure that in torchchat. cc @Jack-Khuu

torchchat/utils/quantize.py

Jack-Khuu · 2025-04-25T00:48:26Z

Last little bit: can you add the new quants into the CI https://github.com/pytorch/torchchat/blob/main/.github/workflows/pull.yml

Essentially whereever you see embedding:wx in pull.yml, just add another call in that same test, but using your new quants instead

metascroy · 2025-04-30T02:38:50Z

Last little bit: can you add the new quants into the CI https://github.com/pytorch/torchchat/blob/main/.github/workflows/pull.yml

Essentially whereever you see embedding:wx in pull.yml, just add another call in that same test, but using your new quants instead

To really test shared embedding, you need to test a model that has embeddings shared with unembeddings. stories110M (currently used in CI) is not one of them.

Some examples: llama1B, llama3B, phi4-mini, etc.

Jack-Khuu · 2025-04-30T17:26:59Z

Hmmm ok let's do this @dillondesilva you can ignore the comments about testing for now
(just hit the style nits + quant order comment)

We'll make a different PR for testing, since it's a tad more involved

mikekg · 2025-05-03T04:35:40Z

torchchat/utils/quantize.py

+            if quantizer == "experimental:embedding":
+                group_size = q_kwargs["groupsize"]
+                bit_width = q_kwargs["bitwidth"]
+                has_weight_zeros = q_kwargs["has_weight_zeros"]
+                weight_granularity = (
+                    PerAxis() if group_size == -1 else PerGroup(group_size)
+                )
+                weight_dtype = getattr(torch, f"int{bit_width}")
+                weight_mapping_type = (
+                    MappingType.ASYMMETRIC
+                    if has_weight_zeros
+                    else MappingType.SYMMETRIC
+                )
+
+                try:
+                    model = EmbeddingQuantizer(
+                        weight_dtype=weight_dtype,
+                        granularity=weight_granularity,
+                        mapping_type=weight_mapping_type,
+                        use_fallback=False,
+                    ).quantize(model)
+                except Exception as e:
+                    print(
+                        "Encountered error during quantization with experimental EmbeddingQuantization: {e}"
+                    )


Why would you not put this into EmbeddingQuantizer class, or a subclass derived from EmbeddingQuantizer.
At a minimum 134-146 seem to be copy pasta that's replicated multiple times, e.g., right below L159-171, and 195-204, 241-252, ...

Good point! Were you thinking of wrapping the quantizers from torchao (opposed to directly updating them) for our specific usecase kind of like this?

class TCQuantizer: def __init__(self, q_kwargs, quantizer): self.q_kwargs = q_kwargs self.quantizer = quantizer_class_dict[quantizer] def quantize(self, model): group_size = self.q_kwargs["groupsize"] bit_width = self.q_kwargs["bitwidth"] has_weight_zeros = self.q_kwargs["has_weight_zeros"] # Other configuration code try: model = self.quantizer( weight_dtype=weight_dtype, granularity=weight_granularity, mapping_type=weight_mapping_type, use_fallback=False, ).quantize(model) except Exception as e: print( "Encountered error during quantization with quantizer: {e}" ) embedding_quantizer = TCQuantizer(q_kwargs, "EmbeddingQuantizer")

dillondesilva · 2025-05-05T00:58:18Z

Hmmm ok let's do this @dillondesilva you can ignore the comments about testing for now (just hit the style nits + quant order comment)

We'll make a different PR for testing, since it's a tad more involved

Yep no worries - sounds like a plan! I'll hop onto these changes soon.

dillondesilva · 2025-05-12T13:27:12Z

Hmmm ok let's do this @dillondesilva you can ignore the comments about testing for now (just hit the style nits + quant order comment)

We'll make a different PR for testing, since it's a tad more involved

Sounds good! Just addressed the style nits + quant order here

Jack-Khuu · 2025-05-12T16:57:38Z

Thanks @dillondesilva I'll try to review (and hopefully merge) this today 😃

…uant

Jack-Khuu · 2025-05-13T18:45:57Z

Digging into this a bit:

python3 torchchat.py generate llama3.2-1b --dtype float16 --quantize '{"experimental:shared": {"bitwidth": 4, "groupsize": 32, "has_weight_zeros": true}}' --prompt "Once upon a time,"

Seems to need a little bit of debugging. Mind tracing through this? Left some initial finds

I think at the end of the conditionals (e.g. if quantizer == "experimental:shared"), it needs a "continue" call since it uses a slightly different signature than the normal calls

Jack-Khuu · 2025-05-13T18:43:44Z

torchchat/utils/quantize.py

+                ).quantize(model)
+            except Exception as e:
+                print(
+                    "Encountered error during quantization with experimental SharedEmbeddingQuantization: {e}"


Suggested change

"Encountered error during quantization with experimental SharedEmbeddingQuantization: {e}"

f"Encountered error during quantization with experimental SharedEmbeddingQuantization: {e}"

Jack-Khuu · 2025-05-13T18:43:56Z

torchchat/utils/quantize.py

+                ).quantize(model)
+            except Exception as e:
+                print(
+                    "Encountered error during quantization with experimental EmbeddingQuantization: {e}"


Suggested change

"Encountered error during quantization with experimental EmbeddingQuantization: {e}"

f"Encountered error during quantization with experimental EmbeddingQuantization: {e}"

Jack-Khuu · 2025-05-13T18:44:29Z

torchchat/utils/quantize.py

+                    weight_dtype=weight_dtype,
+                    granularity=weight_granularity,
+                    mapping_type=weight_mapping_type,
+                    use_fallback=False,


Suggested change

use_fallback=False,

SharedEmbeddingQuantizer doesn't take have a fallback arg

Jack-Khuu

.

Updated quantize.py to include experimental EmbeddingQuantizer and Sh…

599c17b

…aredEmbeddingQuantizer

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 13, 2025

dillondesilva mentioned this pull request Apr 13, 2025

Enable torchao.experimental EmbeddingQuantization #1520

Open

Jack-Khuu requested review from metascroy and Jack-Khuu April 15, 2025 17:03

metascroy reviewed Apr 15, 2025

View reviewed changes