This is repo stores my submission for the second round of selection process for Kharagpur Data Analytics Group (KDAG) Associates.
The first round was a quiz round. The second round involved two coding tasks and a reading task. The coding tasks were Langton's Ants (Basic Python Proficiency) and Music Genre Analysis (NLP), overview of which can be found below. The Jupyter notebook for both tasks are well documented and the Music Genre Analysis report can be found here. The Papers
folder is an Obsidian Vault containing my notes on the reading tasks. Given the fact that I got in, I think a did a decent job :D
You are required to complete both Task 1 and Task 2.
- Submit your solution in
.ipynb
(Jupyter Notebook) format. - Your notebook must run without errors using "Run All".
- Only use libraries explicitly mentioned in the task.
- Explain your thought process using Markdown cells.
- Final selection for interviews is based entirely on performance in these tasks.
Langton’s Ant is a two-dimensional Turing machine simulation where ants move across a grid based on square colors and pheromones. Their simple movement rules lead to complex behavior over time.
-
White Square
- Turn 90° clockwise
- Flip the square color
- Drop a pheromone (e.g., "A" or "B")
- Move forward one unit
-
Black Square
- Turn 90° counter-clockwise
- Flip the square color
- Drop a pheromone
- Move forward one unit
- Self-pheromone: 80% chance to move straight, 20% to follow standard turning rule
- Cross-pheromone: 20% chance to move straight, 80% to follow standard turning rule
- Pheromone Replacement: A new pheromone replaces any existing one
- Pheromone Decay: Influence fades over ~5 steps
- Use Python only
- A simulation interface must be present (Pygame is allowed)
- Bonus points for using Object-Oriented Programming (OOP)
- A
.py
file with runnable simulation code - Code must be structured and readable
Analyze a dataset of songs tagged with three descriptive keywords and a genre label. The objective is to group songs based on their keyword similarities and extract insights.
- Only use
numpy
,pandas
,matplotlib
,seaborn
- No use of
scikit-learn
or similar libraries for algorithm implementation
- Use both BoW (Bag of Words) and TF-IDF
- Compare both techniques and justify your choice
- Vectorize the keywords accordingly
Resources:
- Implement PCA (Principal Component Analysis) manually using numpy
- Reduce vectors to 2D for each keyword type
Resources:
- Combine the reduced vectors into one embedding per song
- Suggested methods: averaging, concatenation, cross-product, etc.
- Justify your method
- Apply K-Means or other clustering method
- Justify your choice of
k
or the method itself - Visualize clusters
Resources:
- What is the genre distribution per cluster?
- Do clusters align with genre labels?
- Calculate Silhouette Score
- Predict genre for:
[piano, calm, slow]
[guitar, emotional, distorted]
[synth, mellow, distorted]
- Use a white-paper format
- Explain thought process, methodology, results
- Include plots and visualizations
- Invent a creative vectorization technique
- Deeper analysis: by genre, cluster, or keyword
- Explore extrinsic and intrinsic clustering metrics
Example idea (not to be used directly):
Create a 26D vector with frequency of each letter in keywords
- A Jupyter Notebook (
.ipynb
), self-contained and error-free - A report in PDF format
Additional Resources:
These papers are essential for your personal interview preparation. Choose at least one to prepare. Additional points will be awarded for preparing more than one.
- Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings
- A comprehensive survey on regularization strategies in machine learning
- Statistical Modeling : The Two Cultures
If you're not familiar with Python or the required libraries, refer to: