Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the tutorial of data analysis using TO RUN statement. #2977

Merged

Conversation

brightcoder01
Copy link
Collaborator

No description provided.

wangkuiyi
wangkuiyi previously approved these changes Sep 29, 2020
Copy link
Collaborator

@wangkuiyi wangkuiyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor comments

Data analysis can help us understand what is in the dataset and the
characteristics of the data.

Data binning is a common used data analysis way. It can group continous values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commonly-used

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, updated.

Data analysis can help us understand what is in the dataset and the
characteristics of the data.

Data binning is a common used data analysis way. It can group continous values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

way => technique or trick

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, updated.

Data analysis can help us understand what is in the dataset and the
characteristics of the data.

Data binning is a common used data analysis way. It can group continous values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can groups

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, updated.

into a small number of discretized bins. We will get the distribution of the
data from the binning result.

We can use SQLFlow TO RUN statement to execute the runnable which is released in the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

execute => call or invocate

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, updated.

into a small number of discretized bins. We will get the distribution of the
data from the binning result.

We can use SQLFlow TO RUN statement to execute the runnable which is released in the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runnable => SQLFlow runnable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, updated.

@wangkuiyi
Copy link
Collaborator

After the merge of this PR, how about posting a demo to Zhihu.com using our playground? @brightcoder01 @lhw362950217

@brightcoder01
Copy link
Collaborator Author

After the merge of this PR, how about posting a demo to Zhihu.com using our playground? @brightcoder01 @lhw362950217

Sure, I'll work on it.

TO RUN sqlflow/runnable:v0.0.1
CMD "binning.py",
"--dbname=creditcard",
"--columns=time,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v22,v23,v24,v25,v26,v27,v28,amount",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, there are so many columns, which one is used for binning? Or, would please explain a little bit of the logic behind the SQL statement?

Copy link
Collaborator Author

@brightcoder01 brightcoder01 Sep 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. All the columns will be binned to 10 buckets in this SQL statement. I'll add some explanation in the doc.
The same bin_method and bin_num will be applied to all the columns if we only assign only one value for these two arguments. If we want to use different bin_method or bin_num for each column, we need assign a list to these two arguments.

"--columns=v1,v2,v3",
"--bin_method=bucket,log_bucket,bucket",
"--bin_method=10,5,20"

I think we need add a pre-made runnable API doc for detailed explanation which is tracked in the issue #2929

each bin and also some common used statistical results.

```SQL
SELECT * FROM creditcard.creditcard_binning_result LIMIT 10;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for a case in playground, we may add a plot, so the user can get a better visual experience.

Copy link
Collaborator Author

@brightcoder01 brightcoder01 Sep 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I will add a plot in the next PR.

lhw362950217
lhw362950217 previously approved these changes Sep 30, 2020
Copy link
Collaborator

@lhw362950217 lhw362950217 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM++

The result table contains the binning boundaries, proability distribution for
each bin and also some common used statistical results.

```SQL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same above. Add %%sqlflow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed.

statement, `v1` will bucketized to 10 bins and `v2` will be bucketized to 5
bins.

```SQL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same above. Add %%sqlflow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed.

SQL statement to do the data binning. All the table columns specified in the
`--column` parameters will be bucketized to 10 bins.

```SQL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add %%sqlflow, so that the converted .ipynb can run on jupyter notebook?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed.

@brightcoder01 brightcoder01 merged commit 53fc075 into sql-machine-learning:develop Oct 7, 2020
@brightcoder01 brightcoder01 deleted the gml/to-run-tutorial branch October 7, 2020 22:22
We can use SQLFlow TO RUN statement to call the SQLFlow runnable which is
released in the form of Docker image. SQLFlow provides some premade runnables
in sqlflow/runnable including the binning runnable. Please use the following
SQL statement to do the data binning. All the table columns specified in the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Briefly explain how the binning was done here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, will add it in the next PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants