机器学习

平时注意构建知识体系，通过读论文和做实验不断为知识体系添砖加瓦。本章侧重理论与实践，系统设计请参考机器学习系统设计

1. 面试要求

熟悉常见模型的原理、代码、如何实际应用、优缺点、常见问题等
- 归纳偏置（Inductive Bias），数据同分布 IID
考察范围包括ML breadth, ML depth, ML application, coding
- 可能持续被追问为什么? 为什么某个trick能起作用？
- 算法背后的数学原理，写出其主要数学公式，并能进行白板推导
- 一些较新的领域，会考察论文细节
- 每一个算法的scale, 如何将算法map-reduce化
- 每一个算法的复杂度、参数量、计算量

2. 八股问题实例

怎么解决nn的 over-fitting
- 从数据角度，收集更多训练数据。求其次的话，数据增强方法。
- 降低模型复杂度，如神经网络中的层数、宽度，树模型中的树深度、剪枝。模型正则化方法，如正则约束L2。集成学习方法，bagging方法。
- Cross-validation to detect over-fitting.
- Train with more data.
- Data augmentation.
- Feature selection.
- Early stop.
- Regularization.
- Ensemble methods.
- Pretrained model
怎么解决under-fitting
- 增加新特征，增加模型复杂度，减少正则化系数。
- 训练模型的第一步就是要保证能够过拟合。
怎么解决样本不平衡问题
- https://imbalanced-learn.org/en/stable/user_guide.html
- 如果是classification，data是long tail的，只是取头部80%的label，其他的label不要了，mark as others
- 如果真的特别imbalance，99.99% 和0.01%，类似spam的情况。就只能试试别的方法，outlier detection之类
- 最后继续引申到样本的难易
- 评价指标：AP(average_precision_score)
- downsampling: faster convergence, save disk space, calibration(=upweight?)
- upweight: every sample contribute the loss equality
怎么解决数据缺失的问题
- How to Handle Missing Data
怎么解决类别变量中的高基数特征 high-cardinality
优化器，如何选择优化器
- MSE, loglikelihood+GD
- SGD-training data太大量
- ADAM-sparse input
数据收集
- production data, label
- Internet dataset
分布不一致怎么解决
- distribution不是特别指的feature的，也有label的。label只能说多收集data，还是balance data的问题。
- data distribution 改变，就是做auto train, auto deploy.如果参数drop太多，只能人工干预重新训练
推荐，scale\abtesting\trouble-shooting
怎么提升模型的latency
- 小模型
- 知识蒸馏
- squeeze model to 8bit or 4bit
Generative vs Discriminative
- A generative model will learn categories of data while a discriminative model will simply learn the distinction between different categories of data.
- Discriminative models will generally outperform generative models on classification tasks. Discriminative model learns the predictive distribution p(y|x) directly while generative model learns the joint distribution p(x, y) then obtains the predictive distribution based on Bayes' rule.
The bias-variance tradeoff is a central problem in supervised learning
- Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously.
- High-variance learning methods may be able to represent their training set well but are at risk of overfitting to noisy or unrepresentative training data.
- In contrast, algorithms with high bias typically produce simpler models that don't tend to overfit but may underfit their training data, failing to capture important regularities.
模型的并行
- 线性/逻辑回归
- xgboost
- cnn
- RNN
- transformer
- 在深度学习框架中，单个张量的乘法内部会自动并行

3. 手写ML代码实例

手写两层fully connected网络
手写CNN
手写KNN
手写K-means
手写softmax的backpropagation
手写AUC
手写SGD
实现dropout，前向和后向
random sample with weights
实现focal loss
手写multi head attention
视觉：手写iou/nms
NLP:
- 手写n-gram
- 手写tokenizer
  - BPE tokenizer
  - BPE tokenizer
延伸
- 给一个LSTM network的结构，计算how many parameters
- convolution layer的output size怎么算? 写出公式
- 设计一个sparse matrix (包括加减乘等运算)

参考

https://github.com/eriklindernoren/ML-From-Scratch
https://github.com/resumejob/interview-questions
https://github.com/2019ChenGong/Machine-Learning-Notes
https://github.com/ctgk/PRML
https://github.com/nxpeng9235/MachineLearningFAQ/blob/main/bagu.md
https://docs.qq.com/doc/DR0ZBbmNKc0l3RGR2
机器学习八股文的答案
ML, DL学习面试交流总结

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

机器学习

1. 面试要求

2. 八股问题实例

3. 手写ML代码实例

参考

Files

README.md

Latest commit

History

README.md

File metadata and controls

机器学习

1. 面试要求

2. 八股问题实例

3. 手写ML代码实例

参考