Skip to content

Conversation

@iSame7
Copy link
Contributor

@iSame7 iSame7 commented Oct 10, 2020

This PR removes unnecessary shuffle from randomSample method of Collection and Sequence.

// FIXME: necessary?
result.shuffle(using: &rng)

Since randomSample method already Swap selected element with a randomly chosen one in the reservoir there is no need to do that extra shuffle for the reservoir. This was marked as // FIXME by removing it, randomSample still randomly selects the specified number of elements from this sequence, and the RandomSampleTests runs successfully.

Checklist

  • I've added at least one test that validates that my change is working, if appropriate
  • I've followed the code style of the rest of the project
  • I've read the Contribution Guidelines
  • I've updated the documentation if necessary

randomSample method already swap selected element with a randomly chosen one in the reservoir.
randomSample method already swap selected element with a randomly chosen one in the reservoir.
@xwu
Copy link
Contributor

xwu commented Oct 11, 2020

Whether the shuffle is unnecessary or not is a mathematical question. I'm not sure I understand how you arrived at the conclusion that "there is no need to do that" based on your description here, although perhaps it is straightforward. Can you show a proof as to why each element of the original element is equally likely to found at any particular position of the result even without this shuffle?

@timvermeulen
Copy link
Contributor

timvermeulen commented Oct 11, 2020

Edit: Disregard, the guide goes into this.

The shuffle is necessary.

The elements in self that aren't part of the first k are indeed equally likely to appear at any particular index of the result due to

let j = Int.random(in: 0..<result.count, using: &rng)

but the first k elements of self can only appear in the result at the position they're originally in. This is easy to see when k is equal to self.count: without the shuffle, rng is only used inside the loop body which is no longer reached, and indeed array.randomSample(count: array.count) will just return a copy of the original array.

@iSame7
Copy link
Contributor Author

iSame7 commented Oct 12, 2020

Thanks @timvermeulen for clarifying. i guess it's important now to look at what reservoir sampling algorithm says, it works pretty much in two steps:

  • Fill the result array with the first k elements from the original array. This is called the "reservoir".
  • Randomly replace elements in the reservoir with elements from the remaining pool.

According to reservoir sampling algorithm each number is selected with the probability of k/n you can find a great explanation for the mathematical proof here and here by induction. @xwu

Example:

  • Pick 3 numbers from [10, 20, 30, 40]. Make sure each number is selected with a probability of 3/4
  • First, pick [10, 20, 30] as the initial reservoir
  • Then pick 40 with a probability of 3/4
  • For 10, it stays with a probability of
    • P(40 is not selected) + P(40 is selected but it replaces 222 or 333)
    • = 1/4 + 3/4*2/3 = 3/4
  • The same case with 20 and 30
  • Now all the numbers have the probability of 3/4 to be picked

That said, giving each element equal probability of being selected i think shuffling the reservoir is unnecessary step perhaps a side effect of randomSample method that changes the expectations when using it since it uses reservoir sampling algorithm.

@xwu
Copy link
Contributor

xwu commented Oct 12, 2020

Now all the numbers have the probability of 3/4 to be picked

This doesn't address the problem that is at discussion here, which is that not all the numbers have equal probability of being picked first. See the README for more details.

@timvermeulen
Copy link
Contributor

Ah, I clearly misunderstood the purpose of the TODO. Whether we need to shuffle is not a mathematical question, it's a question of whether or not we want the order of the sample to be random as well.

@natecook1000
Copy link
Member

Right — this is about the surprise factor of the result sometimes having elements in their initial order, not the likelihood or weighting of particular elements in the result. While random sampling doesn't necessarily have to guarantee that the randomly selected elements will also be in a random order, I'm concerned that it will be surprising for the user to have that not be the case. The related guide covers the topic in the Post Sample Shuffling section.

Since this is an intentional design decision, I'm going to close this for now. Feel free to open a discussion in the forums if you'd like to talk about this more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants