Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flush out content of "Getting Started" #20

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Conversation

ablaom
Copy link
Member

@ablaom ablaom commented Nov 19, 2024

Also switched out DecisionTree for KNNRegressor, because there is no ambiguity of pkg in that case.

@ablaom ablaom marked this pull request as draft November 19, 2024 00:36
Copy link

netlify bot commented Nov 19, 2024

Deploy Preview for mlj ready!

Name Link
🔨 Latest commit 838d2af
🔍 Latest deploy log https://app.netlify.com/sites/mlj/deploys/6768b82bb874540008a1b7e3
😎 Deploy Preview https://deploy-preview-20--mlj.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@ablaom
Copy link
Member Author

ablaom commented Nov 19, 2024

@EssamWisam There is some formatting issues for "Getting Started". The current assumption seems to be all lines in all sections are about he same length.

@ablaom ablaom marked this pull request as ready for review November 19, 2024 00:44
@ablaom ablaom requested a review from EssamWisam November 19, 2024 00:44
@EssamWisam
Copy link
Collaborator

EssamWisam commented Nov 21, 2024

Could you check now if it's better? I had to create new section for loading data (maybe we can also add some form of preprocessing there?) because it doesn't look that good when I make the height/width variable and it exceeds the height of the bullets section on the left.

I really like the work and have only two recommendations:

  • Make the loading data more generic relative to common use by downloading and loading a CSV. For instance:
download("https://raw.githubusercontent.com/JuliaAI/Imbalance.jl/dev/docs/src/examples/bmi.csv", "./")
df = CSV.read("./bmi.csv", DataFrame)

Really optional recommendations but it would look nice if the code blocks had relatively a similar number of lines.

@EssamWisam
Copy link
Collaborator

I can make the height vary with the amount of code but I may have to do away with the copy button because the code is rendered by a library and it doesn't expose the pre tag to position the copy button relative to it. That was also a reason I made the height constant.

@ablaom
Copy link
Member Author

ablaom commented Nov 22, 2024

Thanks @EssamWisam for the prompt response and improvements. Let me get back to you on the suggestions re data. I was trying to avoid adding non-MLJ packages, such as CSV and DataFrames (and Downloads, as Base.download is deprecated).

@ablaom
Copy link
Member Author

ablaom commented Nov 22, 2024

We could do something like:

# load Boston dataset from openml.org:
table = OpemML.load(61)

# split data into target (vector, y) and features (table, X):
y, X = unpack(table, ==(:class))

but this is failing because of an unexpected issue I just reported.

@OkonSamuel
Copy link
Member

image
Should this be here?

@EssamWisam
Copy link
Collaborator

Thanks @EssamWisam for the prompt response and improvements. Let me get back to you on the suggestions re data. I was trying to avoid adding non-MLJ packages, such as CSV and DataFrames (and Downloads, as Base.download is deprecated).

I think there may be value in using DataFrames just because it also shows others in this starter's code something more representative of the typical MLJ program they want to write.

@EssamWisam
Copy link
Collaborator

Hopefully fixed by now. Do you have an idea for something specific to put for the subtitle?

@ablaom
Copy link
Member Author

ablaom commented Nov 28, 2024

Okay, I've gone with adding DataFrames and using OpenML.

@EssamWisam Can you review please?

@EssamWisam
Copy link
Collaborator

Okay, I've gone with adding DataFrames and using OpenML.

@EssamWisam Can you review please?

Looks good unless you plan to add more evaluation methods.

@ablaom
Copy link
Member Author

ablaom commented Nov 29, 2024

Okay @EssamWisam I added an extra evaluation to balance the whitespace.

Can you do me a favour and test the whole "Getting Started" workflow locally. As this may be the only MLJ code an MLJ-curious user runs, it's critical to check it thoroughly.

@EssamWisam
Copy link
Collaborator

Okay @EssamWisam I added an extra evaluation to balance the whitespace.

Can you do me a favour and test the whole "Getting Started" workflow locally. As this may be the only MLJ code an MLJ-curious user runs, it's critical to check it thoroughly.

Here is the full code, in my case I get a timeout error for the openml line of code but not sure if it would be a network problem on my side (unlikely):

using Pkg

# Create a new environment (optional):
Pkg.activate("mlj-env", shared=true)

# Install MLJ
Pkg.add("MLJ")

# Install DataFrames (needed for demo):
Pkg.add("DataFrames")

# Test installation (optional):
Pkg.test("MLJ")

# Use these packages:
using MLJ
import DataFrames # names must be qualified

# Grab iris dataset from the openml.org repository:
data = OpenML.load(61)

# Convert to a data frame:
df = DataFrames.DataFrame(data)

# Inspect column names and types:
schema(df)

# Split data into target (vector, y) and
# features (dataframe, X) & randomize obervations
y, X = unpack(df, ==(:class); rng=123)

# Load the model type:
KNN = @iload KNNClassifier

# Create an instance with default hyperparameters:
knn = KNN()

# Wrap the model and data in a machine:
mach = machine(knn, X, y)

# Fit the machine:
fit!(mach)

# Get training predictions:
yhat = predict(mach, X)

# Evaluate with cross validation:
evaluate!(
    mach,
    resampling=CV(nfolds=5),
    measures=[log_loss, accuracy],
)

# Evaluate with Monte Carlo,
# target-stratified cross validation:
evaluate!(
    mach,
    resampling=StratifiedCV(nfolds=5, rng=456),
    repeats=2,
    measures=[log_loss, accuracy],
)

# Define a hyperparameter range:
r = range(knn, :K, lower=3, upper=10)

# Create self-tuning version of model:
tuned_knn = TunedModel(
    knn, range=r, tuning=Grid(goal=8), measure=log_loss)

# Train the wrapped model:
mach = machine(tuned_knn, X, y) |> fit!

# Predict using optimal model:
predict(mach, X)

# Inspect the best model:
knn_best = fitted_params(mach).best_model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants