Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于这个run()函数 #1

Open
gtbaby opened this issue Apr 16, 2018 · 1 comment
Open

关于这个run()函数 #1

gtbaby opened this issue Apr 16, 2018 · 1 comment

Comments

@gtbaby
Copy link

gtbaby commented Apr 16, 2018

您好!我最近在学习您写的代码,请问这个run()函数中,既然已经存储了训练后的结果,那为何不在后面的不同阈值的循环中,只运行评测部分的代码呢?还是每次阈值不同时,训练结果不同呢?在这里,我看到阈值与训练好像没有关系。
我把run()改成这样你看是否合适
`
def train():
#训练模型
sf = SpamFilter(initial=1)
all_mail = sf.getEmailList(initial=1)
train_mail = all_mail.loc[:len(all_mail)*0.8] #取80%数据集作为训练集
check_mail = all_mail.loc[len(all_mail)*0.8:]
check_mail.to_csv('test_set')
sf.trainDict(train_mail)
sf = SpamFilter()

def run(T = 0):
#评测
threshold = T
check_mail = pd.read_csv('test_set', index_col=0)
check_mail['predict'] = check_mail.wordlist.apply(lambda x:sf.predictEmail(x,threshold))
foo = check_mail.predict + check_mail.spam
all_right = 1-float(foo.value_counts()[1])/foo.value_counts().sum()
ham_right = float(foo.value_counts()[0])/check_mail.spam.value_counts()[0]
print ("Threshold:",threshold)
print ('整体正确率', all_right, '%')
print ("正常邮件获取度",ham_right*100,'%')
return (all_right, ham_right)
`
最后在main中,for循环外加上一句train()
谢谢您的解答

@c1nty
Copy link
Owner

c1nty commented Jul 15, 2018

抱歉才看到。
这样的修改是完全可以的,阈值是用来寻找最优分割点的,后面的代码中通过折线图的方式展示了如何最优阈值,确定阈值的时候不必重新训练模型。
我把训练和预测的代码放在一起的目的是为了交作业的时候方便说明流程。
谢谢您抽时间看我的代码。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants