Skip to content

Sniffer: Fix potential infinite loop #4726

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 16, 2025
Merged

Sniffer: Fix potential infinite loop #4726

merged 6 commits into from
May 16, 2025

Conversation

patterniha
Copy link
Contributor

@patterniha patterniha commented May 14, 2025

during reading sniffer code, i found two bugs and fixed them:

1. infinite-loop bug:

cacheErr := cReader.Cache(payload, cacheDeadline)

if cacheErr != buf.ErrReadTimeout (cacheDeadline not exceed) and payload.IsEmpty(), The code gets stuck in infinite loop, and CPU usage reaches 100%, and the app crashes !

(also if cacheErr == nil and payload doesn't change and ErrProtoNeedMoreData, the same thing happens)

This bug is probably the main reason for #3914.

2. fakedns+others:

After reviewing the code, I realized that "destOverride": ["fakedns+others"] is no different from "destOverride": ["fakedns"], That means "destOverride": ["fakedns+others"] behaves exactly like "destOverride": ["fakedns"].

I also tested this to make sure.

@patterniha patterniha force-pushed the fix-bug7 branch 3 times, most recently from 97d089e to 64ab125 Compare May 14, 2025 22:15
@Fangliding
Copy link
Member

Fangliding commented May 15, 2025

这个所谓bug崩溃你真的自己触发过吗 你这么改也很大问题
3914 的问题是quic嗅探的长度推导问题导致的切片错误 panic信息都在那了你是怎么联系上去的 我没有把pr推推上来是因为这个sniffer是是从v2ray抄来的所以我先发了那里 只是他们那边暂时没人看而已

@Fangliding Fangliding closed this May 15, 2025
@patterniha
Copy link
Contributor Author

@Fangliding

Read carefully what I said.

It doesn't matter that cacheErr cannot be nil, if cacheErr != buf.ErrReadTimeout(cacheDeadline not exceed) it stuck in infinite-loop.

cc@RPRX

@Fangliding
Copy link
Member

我懂了 那撑死加这两行足够了吧
image

@patterniha
Copy link
Contributor Author

Please reopen.

@Fangliding
Copy link
Member

改成我说的那样应该就行了 你这么改肯定不对

@Fangliding Fangliding reopened this May 15, 2025
@patterniha
Copy link
Contributor Author

patterniha commented May 15, 2025

我懂了 那撑死加这两行足够了吧 image

@Fangliding

No, it is not enough.

suppose we received some data and payload is not empty and sniffer is in ErrProtoNeedMoreData state, then if we received cacheErr = nil and payload does not changed -> we stuck in loop again.

this code:

if !mb.IsEmpty() {

shows that it is possible that payload not change, and we have cacheErr = nil, so your changes are not enough.

@Fangliding
Copy link
Member

那就是我最前面说的情况了 cReader.Cache() 要么就是成功read了一次 要么失败 不可能返回相同的buf的

@patterniha
Copy link
Contributor Author

patterniha commented May 15, 2025

That's what I said at the beginning. cReader.Cache() either reads successfully once or fails. It is impossible to return the same buf.

@Fangliding

so, i have two question:

  1. in udp, it is possible to send empty-byte, i tested before, so what happens to buf if we send empty-byte?
  2. If buf always changes, why did you use if !mb.IsEmpty()?

@patterniha
Copy link
Contributor Author

patterniha commented May 15, 2025

@Fangliding

anyway, if we add:

if payloadLen == payload.Len() && err ==nil {
return
}

to your code, your code is also ok.

///

also, totalAttempt = 2 may not be enough for ErrNoClue, so it is better to increase that. we have cacheDeadline so why you set totalAttempt too small?

@Fangliding
Copy link
Member

  1. in udp, it is possible to send empty-byte, i tested before, so what happens to buf if we send empty-byte?

它会被无视

  1. If buf always changes, why did you use if !mb.IsEmpty()?

这段代码已经8年了 也许曾经的机制不是这样 所以我觉得需要真的测出有BUG再去动 没事不要乱改

also, totalAttempt = 2 may not be enough for ErrNoClue, so it is better to increase that. we have cacheDeadline so why you set totalAttempt too small?

因为这会导致一些无法识别的流量使sniffer持续尝试直到超时 比如telegram 等于凭空增加200毫秒延迟 这对很多人都是无法接受的

@patterniha
Copy link
Contributor Author

patterniha commented May 15, 2025

@Fangliding

if payloadLen == payload.Len() && err ==nil is impossible -> This code does not run at all.

and if possible -> it prevents infinite-loop.

--> it is better to add.

///

i change the code base on your code...

@patterniha
Copy link
Contributor Author

patterniha commented May 15, 2025

@Fangliding

and what about second one?

can you tell me what exactly should fakedns+others do?

I'm sure and I tested, there's no difference from ‍fakedns in all situations.

@patterniha
Copy link
Contributor Author

patterniha commented May 15, 2025

Because this will cause some unrecognizable traffic to cause the sniffer to keep trying until it times out, such as telegram, which is equivalent to adding 200 milliseconds of delay out of thin air, which is unacceptable to many people.

we increase totalAttempt only when we are in ErrProtoNeedMoreData or ErrNoClue state, telegram uses different protocol, and sniffer immediately return error for telegram-initial-request.

ErrNoClue mean current-data matches only a part of ‍the protocol(tls/http/quic) and sniffer cannot determine whether there will eventually be a match or not.

It's very rare that we have another protocol that matches part of tls/http/quic protocol, so we can safety increase totalAttempt and we will not have a delay.

@Fangliding
Copy link
Member

Fangliding commented May 15, 2025

返回NoClue的地方非常多而且以后也可能加新的嗅探器 它的作用仅仅只是与完全不匹配区分并在下一次嗅探不再重复应用嗅探器 不应该期望它全部为ErrNoClue作出什么行为

而且我还是那句话 它真的会无限循环吗? 你有日志辅助证明吗? 你在loop里打个日志输出看到它真的loop了很多次都行 没有的话我觉得就算了 这是最后回复了 没有的话懒得跟你接你话了

@Fangliding Fangliding closed this May 15, 2025
@patterniha
Copy link
Contributor Author

patterniha commented May 15, 2025

Screenshot 2025-05-15 065636

reproduction:

  1. add print to loop:
counter := 0
    for {
        counter++
	println("loop", counter)
        ...
}
  1. For simplicity, i use trojan-inbound with sniff enabled.

  2. then send tcp-trojan-header to inbound-address, without any payload-data. (i use tcp:8.8.8.8:443 for trojan-address)

that is (hexStream): "37653661343330396464663665383836363637396636316163653466363231623065333435356562616332653833316136306631336364310d0a01010808080801bb0d0a"

decrypt-password for use in Xray-core trojan-inbound config: "12345678"

  1. then close the connection immediately.

then you see Thousands of loops in Xray-core log !!!

///

after reading the code i see that you use ErrNoClue carelessly, and if we have ErrNoClue It does not guarantee that a part of protocol is matched.

so i revert totalAttempt to 2, and remove totalAttempt++ for ErrProtoNeedMoreData .

///

please reopen again, this is why Xray-core was crashing randomly when sniffing was on.

@RPRX

where are you? This is one of the biggest bugs discovered and you haven't commented yet?

@Fangliding
Copy link
Member

Fangliding commented May 15, 2025

有复现就好说很多了 虽然这和panic没关系 会多吃cpu而已

但是我还是觉得改两行就够了 没必要parse具体的read err更别动totalAttempt

如果你还是觉得有问题可以提供新的proof

@Fangliding Fangliding reopened this May 15, 2025
@patterniha
Copy link
Contributor Author

patterniha commented May 15, 2025

It's much easier to explain if you can reproduce it. Although this has nothing to do with panic, it just consumes more CPU.

But I still think that changing two lines is enough. There is no need to parse the specific read err, let alone totalAttempt.

If you still think there is a problem, you can provide a new proof

@Fangliding

Why should we run some redundant loop when cacheErr != nil ?

my plan is to fix ErrNoClue and then we can increase totalAttempt to 32.

but with your code we will run 31 redundant loop, please revert, thx.

@patterniha
Copy link
Contributor Author

@Fangliding

and what about second one?

can you tell me what exactly should fakedns+others do?

I'm sure and I tested, there's no difference from ‍fakedns in all situations.

@Fangliding

I'm still waiting for your answer for ‍fakedns+others.

@RPRX
Copy link
Member

RPRX commented May 16, 2025

现在的两个pr应该可以马上合并然后发新版

好的

@RPRX RPRX merged commit bb0e561 into XTLS:main May 16, 2025
39 checks passed
@patterniha
Copy link
Contributor Author

patterniha commented May 16, 2025

X25519MLKEM768 is a very important update. The sooner the server is updated, the better and it needs to be stable. I am afraid that the other things you have added will make v25.5.16 unstable.

@RPRX

my PRs is important too, after fixing about 20 small and big bugs, I wonder why you still don't trust me.

anyway, even @Fangliding reviewed my PRs and found no problem.

@RPRX
Copy link
Member

RPRX commented May 16, 2025

没有针对谁,#3813 #4553 等 PR 我也没有在新版前合并,不然万一有 bug 了又要改,v25.5.16 的改动较小才适合成为新的稳定版

@dyhkwong
Copy link
Contributor

瞎鸡巴乱改的是不是傻逼

@Fangliding
Copy link
Member

啊 这样会炸吗

@RPRX
Copy link
Member

RPRX commented May 16, 2025

刚发版,别搞这种啊

@RPRX
Copy link
Member

RPRX commented May 16, 2025

Anyway v25.5.16 先这样吧,如果引入新的 bug 了你们才会知道为什么我不倾向于在发重要版本前合并些未经验证的 PR

@RPRX
Copy link
Member

RPRX commented May 16, 2025

麻了,重新发版后 GitHub Action 的 check-assets 炸了 https://github.com/XTLS/Xray-core/actions/runs/15068312131

@RPRX
Copy link
Member

RPRX commented May 16, 2025

但是 https://github.com/XTLS/Xray-core/actions/runs/15068312128 的 check-assets 却不会炸,什么区别 @Meo597 @KobeArthurScofield

@KobeArthurScofield
Copy link
Contributor

疑似 GitHub 卡点炸了

@RPRX
Copy link
Member

RPRX commented May 16, 2025

"Event Validation Error: The event type release is not supported because it's not tied to a branch or tag ref."

@patterniha
Copy link
Contributor Author

It's a pain. After re-releasing the version, the check-assets of GitHub Action crashed https://github.com/XTLS/Xray-core/actions/runs/15068312131

@RPRX

i don't understand why you release asset files too.

if there is no geo-files, you can just display a message to users: "no geo files found, please download from ...".

and for tests you can use fixed(outdated) geo files.

@RPRX
Copy link
Member

RPRX commented May 16, 2025

@KobeArthurScofield 不是卡点,试了几次就这个炸,可能是新加的这个 check-assets 遇到同 tag 删了重发这种会炸?

@KobeArthurScofield
Copy link
Contributor

就很怪,要不重新 release 看看?可能是这个 release 的时候不知道出了什么问题

@KobeArthurScofield
Copy link
Contributor

后面找不到 assets 重新下载的流程似乎也有奇怪的问题, @Meo597 有没有想法

@RPRX
Copy link
Member

RPRX commented May 16, 2025

你们先研究一下吧,实在不行 v25.5.17,这一天天的被背刺的

@RPRX
Copy link
Member

RPRX commented May 16, 2025

不过应该不用 v25.5.17,既然之前把 VCS 关了,按我以前的办法开个 branch 叫 v25.5.16 然后把 release 判断取反并手动填入 commit 七位短 id 并手动触发就行,@KobeArthurScofield 你有权限试一下吗

@KobeArthurScofield
Copy link
Contributor

没权限啊 😂

@patterniha patterniha deleted the fix-bug7 branch May 16, 2025 13:26
@RPRX
Copy link
Member

RPRX commented May 16, 2025

80ea4e0 可行,并且由于之前改成了非 main 的 push 也会触发 action,无需手动触发

@RPRX
Copy link
Member

RPRX commented May 16, 2025

不过你们有空时还是得看下是不是哪里写错了,导致这次这么怪

@KobeArthurScofield
Copy link
Contributor

怪死了,我 fork 下来跑没有复现

有可能是 release 的时候 GitHub 不知道什么问题(race condition 之类的)导致最后进 Actions 的环境有问题,这种情况重跑不行,重跑只针对随机出现的问题,但是环境有问题就不好使了。

不过关了 VCS 的确在这种情况下挺方便的

@Meo597
Copy link
Contributor

Meo597 commented May 16, 2025

我看了一下日志,是github action没能获取到context.ref导致的,报的实际上是类似空指针异常

这种情况即便没有chk-assets后续也绝不可能成功
因为后续geodat找不到
https://github.com/XTLS/Xray-core/actions/runs/15068312125/job/42358265737
证明了这一点

然后chk-assets的流程是:
release-win7、test发现资源文件不存在,只会强制休息90秒,所以这俩的chk-assets永远不会报错
如果release发现资源文件不存在,调度任务scheduled-assets-update.yml,会携带当前分支 or tag信息
现在的问题是不知道为啥context.ref是空的,因此调度任务的时候出错了

根据action的机制,re-run是没用的,因为这些信息是在任务创建的时候装入的

@Meo597
Copy link
Contributor

Meo597 commented May 16, 2025

然后我试了几次,复现不了。
每次都能获取到ref,是release时打的tag

@Meo597
Copy link
Contributor

Meo597 commented May 17, 2025

@RPRX 我刚才又仔细地测了一下

Build and Release #4204: Release v25.5.16 published by RPRX

这种: Release 都是事先已打tag,然后创建release
然而不管是在Release时创建tag,还是有了tag再release、又或是release、tag都删了重建
都没办法复现这个BUG

我倾向于是github内部bug
毕竟action没有context.ref挺匪夷所思的

然后我看了文档,action的cache是分支间不共享,除非继承
ref丢了因此也没法利用到main的缓存,即使没chk-assets也不可能发布成功

想要避免这种问题解决办法就是写防御性代码
在chk-assets没有ref就跳过

然后在后面的buid步骤恢复缓存必定失败,直接curl下载geodat

来回测了七八次,只发现workflow中原先等待90s有时候不够,github调度任务有时延迟较大

@RPRX
Copy link
Member

RPRX commented May 17, 2025

可能是 GitHub 的 bug 吧

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants