explore_eager and new data structure for efficient appends #75
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR address an issue I've been experiencing in interactive use. I want to run the maximization function and then try custom values based in its result. However, the only way to do that is to "initialize", and doing that blows away all previous values of
self.X
andself.Y
.To this end I made a function
explore_eager
(name pending), which works likeexplore
, but immediately evaluates those values and adds them to your current explored data. It does this in such a way that if self.maximize is called again it uses the new custom points when it fits its GMM.Now, to make this change I had to do some non-trivial code refactoring. I saw a few issues in the code base that I felt I should address while I was in there. The main problem was the use of np.vstack and np.append after evaluating every single point. These numpy methods are O(n) as opposed to the O(1) python list.append function. With a large sample, these will eventually start causing problems.
To fix this I abstracted the point storage into a class called
TargetSpace
to manage the bounds, X array, and Y array. Instead of using python lists I found that using custom np.empty over-allocation with numpy views to be an effective solution. This also prevents the need for casting a list of lists to an array when a scipy function is called (which would also be O(n)).I think the TargetSpace api is pretty clean and it makes the code in bayesian_optimization.py a lot simpler. The main function are:
num
random points within the bounds.__len__
- num unique points added so far__contains__
- returns True if we have seen the point thusfarself.X
self.Y
Because the class manages only unique points, we can avoid removing duplicates every time a GP is fit. This should also offer some amount of speedup.
Because this is a fairly large change I also added comprehensive unit tests (targeted towards pytest).
Having this structure should make it possible to further simplify the
bayesian_optimization.py
codebase and api, but I wanted to make those changes fairly minimal at least in this first pass. (For instance it should be possible to seemlessly accept a dict of lists / list of dicts / ndarray / list of lists / pandas dataframe as an input point(s).)