-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What's in store for Auto-Sklearn? -- From the Developers #1677
Comments
I think that such a refactor would be very beneficial, and make auto-sklearn even more useful. Are there opportunities to contribute to this refactor? |
Hi @AmirAlavi, unfortunately not at this time but once we publish there will be definitely be a lot of opportunities to contribute! Thanks very much for offering, we appreciate it :) |
Hi, @eddiebergman. I and my team are really looking forward to AutoML integration to ONNX formats. We've been really excited working with AutoML do far, and can wait to take interoperability to another level. |
Hello, brief update: We're in the process of doing some benchmarking, namely to ensure we can still handle most datasets. Progress after this will be to re-enable some more performance features of the original auto-sklearn, namely iterative fitting, meta-learning and AutoSklearn2. @manuelinfosec I am unfamiliar with ONNX, would you mind be able to help me with a few questions?
|
Update on ONNX: I spent some time with However this doesn't seem to be supported by If anyone has more information on something I could be missing, please let me know either in the above mentioned issue or here! |
@eddiebergman something I'd like to see from this refactor is the ability to specify an optional Or rather than adding that special param to the constructor, giving the aforementioned callback hook access to the runhistory? |
@eddiebergman something else we noticed was the extremely long We didn't investigate it thoroughly, but I think I recall discovering that the HistGradientBoosting models would take a very long time to fit. I know ensembling adds another layer to this, but I think we had observed this even for I'm wondering if you had also noticed any performance issues with that, and if the new updates address it? (perhaps the upgrade to newer sklearn takes care of it) |
Note: If you appreciated this longer form insight into the progress and new design, please give an emoji response, otherwise we can just stick to short form responses :) Please feel free to ask about other topics and I can write up a response of the new underlying API and how things will work. Hi @AmirAlavi, Regarding point 1. Callbacks are how the new autosklearn is mostly built, i.e. that's how most of the control flow now works (with some additional extra features ;) ). However to keep things simple for the end user, the To this end, we have a very simple argument However this Tasks and PluginsTo give you some idea of how this works and how we tried to increase the surface area of interaction, we now follow a more "server" like control flow, i.e. event-driven. The bit I'll share for today is namely the notion of a
With this context, here is 2 examples of how you could accomplish your goal of stopping after a number of successful trials, the first focusing on the notion of a askl = AutoSklearn(...)
task = askl.trial_task
# Option 1
@task.on_success(when=lambda: task.on_success.count >= 10) # optional `when=`
def stop_autosklearn(report: Trial.Report) -> None:
askl.stop()
# Option 2
task.on_success(
stop_autosklearn,
when=lambda: task.on_success.count >= 10,
)
# This `.count` property doesn't exist but I'll add it, thanks for the illuminating question!
# has other events such as `on_{cancelled/failed/crashed/memout/timeout/...}
askl.run(...) There are also "plugins" which modify behaviors of from typing import ParamSpec, TypeVar
from datetime import datetime
from ... import TaskPlugin, Task, Emitter, Event
P = ParamSpec("P")
R = TypeVar("R")
class MyPlugin(Emitter, TaskPlugin):
"""A TaskPlugin interacts with a submission of a Task before
it hits the Scheduler. This can be used to modify the function or its
arguments, as well as getting the highest priority in terms of responding
to events emitted from the task.
"""
name = "my-plugin"
"""Name of the plugin for logging purposes"""
SUCCESS_LIMIT_REACHED: Event[str] = Event("an-event-name")
"""Will emit the current time when the limit is reached"""
def __init__(self, n: int) -> None:
super().__init__()
self.n = n
self.count = 0
# These are how the callbacks are created
# by using the `Event` defined above, it enables type-safety
# for anyone using the `.on_reached` attribute to register
# callbacks
self.on_reached = self.subscriber(self.SUCCESS_LIMIT_REACHED)
# The typing is like a `Callable` if familiar (not required)
self.task: Task[..., Trial.Report] | None = None
def pre_submit(
self,
fn: Callable[P, R],
*args: P.args,
**kwargs: P.kwargs,
) -> tuple[Callable[P, R], tuple, dict] | None:
# TaskPlugin can modify the function and args before
# it's submitted to the Scheduler
if self.count >= self.n:
return None # Scheduler will not submit anything
return fn, args, kwargs # Submit as normal
def attach_task(self, task: Task[..., Trial.Report]) -> None:
self.task = task
task.on_returned(self._check_to_stop) # This `.on_returned` created the same way as above
def _check_to_stop(self, report: Trial.Report):
assert self.task is not None
if report.status is Trial.Status.Success:
# Or just if report.status == "success"
self.count += 1
if self.count >= self.n:
time_stamp = datetime.now().isoformat()
self.on_reached.emit(time_stamp)
myplugin = MyPlugin()
# Only now are we really in the context of AutoSklearn
askl = AutoSklearn(..., plugins=[myplugin])
@myplugin.on_reached
def stop_askl(timestamp: str) -> None:
askl.stop()
print(f"askl_stopped at {timestamp} after {myplugin.n} successes") These plugins are how a lot of the additional, optional, functionality is given to autosklearn tasks, such as memory limiting and call limiting. Sorry for the long response to what's a rather simple question, I wanted to share a little bit of how the internals work so that you can give any feedback or raise any other questions. I unfortunately can still not share any source code. Regarding question 2. I don't think we ever noticed this so thanks for bringing it to our attention. Could you raise a separate issue so that we have a note to investigate this? Unfortunately it's a little more complicated than just an ensemble. AutoSklearn works by ensembling all fold models from cv-folds and then also creates a weighted ensemble on top this. For example, with 5 fold CV and an ensemble which contains 3 models, this will be the weighted probabilities of 3 ensembles of 5 models, i.e. all 15 train sklearn models will be used. This still wouldn't fully explain the fitting times but just for some extra info. |
@eddiebergman Thank you for the longform response! It was very useful and informative. I like the new API from what it sounds like (though I don't think i have my head fully wrapped around it, since i'm not familiar with python's concurrent package). To confirm/summarize:
Will we be able to specify the bayesian optimization algorithm as well? for example, what if rather than SMAC (which I believe uses a regression tree as the posterior probability model, and also flips a coin every time to determine whether to listen to the posterior model or totally randomly explore), I'd like to try another algorithm? Will the new API have a "registry" of such algorithms already defined? or maybe define the API under which we can add our own? |
@eddiebergman Regarding point two, sure I'll make an issue so there's a record of it, but not sure i'll get to a thorough investigation myself. And thank you! I wasn't aware of that detail of using all models from the CVs in the ensembling, and that could certainly partially explain why even when setting ensemble size=1, the refit would still take long (since i was using 10-fold CV)! That's really good to know going forward. |
@eddiebergman I have another feature request/consideration for you :) A pain point currently is making sure we're doing a proper cross-validation scheme. Because we're doing algorithm selection and hyperparameter tuning (querying many candidate models), I think we need to do something like nested CV. The current Concretely, you might do this to start with: X_train, X_test, y_train, y_test = train_test_split(X, y)
cv = StratifiedKFold(n_splits=5)
model = AutoSklearnClassifier(..., cv=cv, ...)
# "model selection" and "hyperparameter tuning" done here,
# but askl will do cv to do this selection, and so no model will
# see the whole X_train
model.fit(X_train, y_train)
# now that we've fixed our model, fit it on the whole train set
model.refit(X_train, y_train)
y_pred = model.predict(X_test)
log_to_experiment_tracker(accuracy_score(y_test, y_pred)) But you'll quickly realize you'll want to do an outer loop of CV rather than just a single train test split, so that you can get a distribution to estimate your generalization error, rather than just a point estimate. One could call this "outer cv" splitter yourself outside the context of askl, but that would require you calling it in an expensive, slow, for loop; you could parallelize this, but you're on your own for that logic. Can we make it so that askl can do the outer loop for you as well? Since it's already taking care of running jobs asynchronously and dispatching them to some compute backend. |
Hello, is there something here I can contribute to |
I have a counterpoint to my own suggestion. In the case when you have fixed resources, you have two options:
With (1), for each run to try N total algorithms, you may have to run them longer Concretely, if machine has 10 cores, 100GB, to do 5 fold outer CV:
The total time is the same between these two. @eddiebergman is it correct to say that each automl job would also see the same number of algorithms (roughly)? In other words: if you have fixed resources, one could argue that the outer CV loop not being parellelized is ok. If you have scalable resources, then there's a case for the feature request under consideration |
Hi @AmirAlavi,
Yup, the estimator is still the opinionated
The actual SMAC related code in AutoSklearn has now been reduced to be about 10 lines :) While theoretically other optimizers are possible, the main limitation is whether an optimizer supports a space with conditional hyperparameters. It's possible to use other optimizers which don't take this into account but this would lead to the optimizer getting confused if say for example, it chose the model to evaluate as an SVM and Extra details we can shareThe reason we need a "static search space" is that you can statically define your pipelines. This is what will enable the new autosklearn to optimize your own sklearn pipelines, not just our opinionated ones. from ... import Pipeline, step, choice
from ConfigSpace import Float
from sklearn.pipeline import Pipeline as SklearnPipeline
pipeline = Pipeline.create(
step("imputer", SimpleImputer, space={"strategy": ["mean", "median"]),
choice(
"estimator",
step("rf", RandomForestClassifier, space={"n_estimators": (1, 10)}),
step("svc", SVC, space={"C": Float("C" (0.0, 10.0), log=True)})
)
)
# Pass in something that parses out a static space with `parser=...`
# or it will automatically try to find a suitable one.
# In this case, only the ConfigSpaceParser will now how to deal
# with the `ConfigSpace.Float` parameter in the space and you
# will get back a ConfigSpace
space = pipeline.space()
# Likewise for `sample=...`. This will automatically use the `ConfigSpaceSampler` as `isisntance(space, ConfigSpace)`
config = pipeline.sample(space)
askl = AutoSklearnClassifier(pipeline=pipeline)
askl.fit(x, y)
# Note here that it gives back a pure sklearn object, no autosklearn
# objects are inside
best: SklearnPipeline = askl.best_
Yup that's a fair point. While having this
Roughly yes, that's correct but I have no concrete numbers to give you there.
In terms of enabling this functionality with scalable resources, we do have something like this now, where you can create a 10 core from ... import Scheduler
from autosklearn import AutoSklearnClassifier
total_cores = 10
cores_per_askl = 2
scheduler = Scheduler.with_processes(total_cores)
xs, ys = ...
askls = [
AutoSklearnClassifier(..., scheduler=scheduler, njobs=cores_per_askl) for _ in range(5)
]
for (x, y, askl) in zip(xs, ys, askls):
askl.fit(x, y) While not fully tested, we should also give capabilities to run on other kinds of resources: from ... import Scheduler
from dask import ...
# Pass in an Executor
# https://docs.python.org/3/library/concurrent.futures.html#executor-objects
client = dask.x.y.z(...)
scheduler = Scheduler(executor=client.get_executor())
# Some native support (using dask-jobqueue)
scheduler = Scheduler.with_slurm(...)
# Using sklearns Loky Backend parallelism
scheduler = Scheduler.with_loky(...) This will be considered use at your own risk and will rely on user contributions to ensure stability, we can not fully test with all possible backends. Using remote resources with a seperate file-system is low priority and untested (we can't afford AWS for testing) but theoretically possible. |
Hi @Aditi840 We are not currently accepting contributions until this re-work is done but we appreciate the offer! |
@eddiebergman, fine |
hi @eddiebergman did u have any comment about this issue #1695? |
Hi @eddiebergman, any new updates on the anticipated timeline for the |
We have a working prototype we can't share publicly yet, but in the meantime check out |
@eddiebergman thanks for all these work! what other AutoML tools would you vouch for? (e.g. PyCaret, TPOT, H2O, EvalML) |
I highly recommend AutoGluon but you could also refer to methods properly evaluated on the AutoML Benchmark |
What's going on?
Auto-Sklearn has recently been under-maintained, we appreciate that this has caused many users to face dependency issues as pinned dependencies slowly start going out of data. While we support this project primarily through academic means, we are still proud of the community that has formed around it and are dedicated to push it forward.
Will Auto-Sklearn still be maintained?
Yes, auto-sklearn will be maintained and updated moving forward! We initially tried some of these updates, e.g. #1611, #1618 but there were larger issues at play. To alleviate this, we are currently working on a major refactor of the tool, introducing more flexibility and long-wanted features, including pipeline export, flexible pipelines, and a modular design. We expect the first prototype will be available within the next 1-2 months.
Why the refactor?
Auto-Sklearn was initially built during Python 2 and during the eariler days of scikit-learn. Machine learning libraries and their eco-system were still developing and a lot has changed since then. There were also a lot of lessons learned which while easy in concept, truly difficult to integrate into the current design.
Doing research with Auto-Sklearn has also become harder. By becoming a robust and well-performing tool, this has made performing novel research with Auto-Sklearn more difficult.
What to expect?
... Not that much, it's a refactor to get back to where we were but with the goal to make it more extensible.
We will still maintain the front facing
AutoSklearnClassifier
andAutoSklearnRegressor
, to act primarily as it did before and staying very scikit-learn like with it's simple interface.This refactor will allow us to solve some long standing issues that have arose. We looked through all the issues and tried to categorize what this new refactor will enable. Not all of these issues will be solved upon release but they will provide a tangible rode towards these.
encoded_missing_value
toOrdinalEncoder
#1615OneHotEncoder
#1614quantile
inHistGradientBoostingRegressor
#1613y
argument in thetransform
method #1494What can I do?
Please let us know what you think and what you'd like to see from this rebuild!
The text was updated successfully, but these errors were encountered: