关于这个run()函数 #1

gtbaby · 2018-04-16T23:55:03Z

您好！我最近在学习您写的代码，请问这个run()函数中，既然已经存储了训练后的结果，那为何不在后面的不同阈值的循环中，只运行评测部分的代码呢？还是每次阈值不同时，训练结果不同呢？在这里，我看到阈值与训练好像没有关系。
我把run()改成这样你看是否合适
`
def train():
#训练模型
sf = SpamFilter(initial=1)
all_mail = sf.getEmailList(initial=1)
train_mail = all_mail.loc[:len(all_mail)*0.8] #取80%数据集作为训练集
check_mail = all_mail.loc[len(all_mail)*0.8:]
check_mail.to_csv('test_set')
sf.trainDict(train_mail)
sf = SpamFilter()

def run(T = 0):
#评测
threshold = T
check_mail = pd.read_csv('test_set', index_col=0)
check_mail['predict'] = check_mail.wordlist.apply(lambda x:sf.predictEmail(x,threshold))
foo = check_mail.predict + check_mail.spam
all_right = 1-float(foo.value_counts()[1])/foo.value_counts().sum()
ham_right = float(foo.value_counts()[0])/check_mail.spam.value_counts()[0]
print ("Threshold:",threshold)
print ('整体正确率', all_right, '%')
print ("正常邮件获取度",ham_right*100,'%')
return (all_right, ham_right)
`
最后在main中，for循环外加上一句train()
谢谢您的解答

c1nty · 2018-07-15T00:49:13Z

抱歉才看到。
这样的修改是完全可以的，阈值是用来寻找最优分割点的，后面的代码中通过折线图的方式展示了如何最优阈值，确定阈值的时候不必重新训练模型。
我把训练和预测的代码放在一起的目的是为了交作业的时候方便说明流程。
谢谢您抽时间看我的代码。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于这个run()函数 #1

关于这个run()函数 #1

gtbaby commented Apr 16, 2018 •

edited

Loading

c1nty commented Jul 15, 2018

关于这个run()函数 #1

关于这个run()函数 #1

Comments

gtbaby commented Apr 16, 2018 • edited Loading

c1nty commented Jul 15, 2018

gtbaby commented Apr 16, 2018 •

edited

Loading