-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the tutorial of data analysis using TO RUN statement. #2977
Add the tutorial of data analysis using TO RUN statement. #2977
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with minor comments
doc/tutorial/fraud-analysis.md
Outdated
Data analysis can help us understand what is in the dataset and the | ||
characteristics of the data. | ||
|
||
Data binning is a common used data analysis way. It can group continous values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commonly-used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, updated.
doc/tutorial/fraud-analysis.md
Outdated
Data analysis can help us understand what is in the dataset and the | ||
characteristics of the data. | ||
|
||
Data binning is a common used data analysis way. It can group continous values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
way => technique or trick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, updated.
doc/tutorial/fraud-analysis.md
Outdated
Data analysis can help us understand what is in the dataset and the | ||
characteristics of the data. | ||
|
||
Data binning is a common used data analysis way. It can group continous values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can groups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, updated.
doc/tutorial/fraud-analysis.md
Outdated
into a small number of discretized bins. We will get the distribution of the | ||
data from the binning result. | ||
|
||
We can use SQLFlow TO RUN statement to execute the runnable which is released in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
execute => call or invocate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, updated.
doc/tutorial/fraud-analysis.md
Outdated
into a small number of discretized bins. We will get the distribution of the | ||
data from the binning result. | ||
|
||
We can use SQLFlow TO RUN statement to execute the runnable which is released in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runnable => SQLFlow runnable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, updated.
After the merge of this PR, how about posting a demo to Zhihu.com using our playground? @brightcoder01 @lhw362950217 |
Sure, I'll work on it. |
TO RUN sqlflow/runnable:v0.0.1 | ||
CMD "binning.py", | ||
"--dbname=creditcard", | ||
"--columns=time,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v22,v23,v24,v25,v26,v27,v28,amount", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, there are so many columns, which one is used for binning? Or, would please explain a little bit of the logic behind the SQL statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. All the columns will be binned to 10 buckets in this SQL statement. I'll add some explanation in the doc.
The same bin_method and bin_num will be applied to all the columns if we only assign only one value for these two arguments. If we want to use different bin_method or bin_num for each column, we need assign a list to these two arguments.
"--columns=v1,v2,v3",
"--bin_method=bucket,log_bucket,bucket",
"--bin_method=10,5,20"
I think we need add a pre-made runnable API doc for detailed explanation which is tracked in the issue #2929
each bin and also some common used statistical results. | ||
|
||
```SQL | ||
SELECT * FROM creditcard.creditcard_binning_result LIMIT 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for a case in playground, we may add a plot, so the user can get a better visual experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I will add a plot in the next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM++
doc/tutorial/fraud-analysis.md
Outdated
The result table contains the binning boundaries, proability distribution for | ||
each bin and also some common used statistical results. | ||
|
||
```SQL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same above. Add %%sqlflow
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, fixed.
doc/tutorial/fraud-analysis.md
Outdated
statement, `v1` will bucketized to 10 bins and `v2` will be bucketized to 5 | ||
bins. | ||
|
||
```SQL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same above. Add %%sqlflow
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, fixed.
doc/tutorial/fraud-analysis.md
Outdated
SQL statement to do the data binning. All the table columns specified in the | ||
`--column` parameters will be bucketized to 10 bins. | ||
|
||
```SQL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add %%sqlflow
, so that the converted .ipynb
can run on jupyter notebook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, fixed.
We can use SQLFlow TO RUN statement to call the SQLFlow runnable which is | ||
released in the form of Docker image. SQLFlow provides some premade runnables | ||
in sqlflow/runnable including the binning runnable. Please use the following | ||
SQL statement to do the data binning. All the table columns specified in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Briefly explain how the binning was done here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, will add it in the next PR.
No description provided.