skip_grams

 发现这块逻辑存在问题，



words_count = Counter(words)
words = [w for w in words if words_count[w] > 50]
In [19]:

vocab = set(words)
vocab_to_int = {w: c for c, w in enumerate(vocab)}
int_to_vocab = {c: w for c, w in enumerate(vocab)}
In [20]:
print("total words: {}".format(len(words)))
print("unique words: {}".format(len(set(words))))
total words: 8623686
unique words: 6791
In [21]:

int_words = [vocab_to_int[w] for w in words]
# 其实vocab_to_int这个数据只是每个单词对应的第一次出现的位置

t = 1e-5 # t值
threshold = 0.9 # 剔除概率阈值

# 然后这里居然用这个下标用来计算词频？？有人能告诉我是什么情况
int_word_counts = Counter(int_words)
total_count = len(int_words) 
word_freqs = {w: c/total_count for w, c in int_word_counts.items()}
 
prob_drop = {w: 1 - np.sqrt(t / word_freqs[w]) for w in int_word_counts}
# 对单词进行采样
train_words = [w for w in int_words if prob_drop[w] < threshold]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

skip_grams #25

其实vocab_to_int这个数据只是每个单词对应的第一次出现的位置

然后这里居然用这个下标用来计算词频？？有人能告诉我是什么情况

对单词进行采样

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

skip_grams #25

Description

其实vocab_to_int这个数据只是每个单词对应的第一次出现的位置

然后这里居然用这个下标用来计算词频？？有人能告诉我是什么情况

对单词进行采样

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions