If you've been preparing for a Google or Meta ML engineering interview, the Atlassian loop will feel both more practical and more idiosyncratic. The classical algorithm round is similar - LeetCode mediums, talk through your reasoning, write clean code. But Atlassian will also ask you to implement an ML algorithm from scratch in NumPy, debug an underperforming model with you as the only engineer in the room, and pass a dedicated Values round that has cost technically-strong candidates the offer.
This guide walks through five representative problems from the Atlassian ML engineer loop, how the ML system design round works, and the level differentiation between P4, P5, and P6 for ML roles.
How Atlassian's ML loop differs from the rest
Three notable structural choices:
1. Implement-from-scratch is mandatory. At Google or Meta you can typically pass an ML interview using pseudo-code and discussing your approach. Atlassian requires at least one round where you write working NumPy code from a blank file. K-means, linear regression with gradient descent, or a recommendation algorithm using collaborative filtering are the three most common asks. No scikit-learn, no PyTorch.
2. The Values round is real. Atlassian famously articulates five company values, and one round of your loop is dedicated to grading you against them. The most-asked value pair is "Open Company No Bullshit" and "Play as a Team" - they want to see directness combined with collaboration. Candidates who pass on technical signal but read as guarded or self-aggrandizing in the Values round get declined.
3. Fully remote, no exceptions. Atlassian was the first major tech company to declare permanent fully-remote work (their TEAM Anywhere policy). Interviews are conducted on Zoom or Google Meet. There's no in-office pressure, but it also means the interviewer is judging your remote-work signal: how you present on video, how you handle a screen-share, how you write code in a shared editor.
The interview process for ML engineering roles
| Stage | Format | Focus |
|---|---|---|
| Recruiter screen | 30 min | Background, target team, comp expectations |
| Hiring manager screen | 45 min | Recent ML projects, team fit. Often includes one easy ML conceptual question |
| Technical phone screen | 60 min | 1 LeetCode-style problem + 1 ML conceptual / from-scratch coding |
| Onsite Round 1: Coding | 60 min | 2 problems: 1 classical algorithm + 1 ML implementation from scratch |
| Onsite Round 2: ML system design | 60 min | Design an ML system end-to-end (data, training, serving, monitoring) |
| Onsite Round 3: Model debugging / case study | 60 min | Given a poorly-performing model, diagnose and propose fixes |
| Values round | 45 min | Behavioral against the 5 Atlassian values |
| Hiring manager wrap | 30 min | Q&A, sell mode |
Problem 1: Implement K-means clustering from scratch
Question: Given an array of 2D points and an integer k, implement K-means clustering. Return the final cluster assignments and centroid positions. Use NumPy only - no scikit-learn.
import numpy as np
def kmeans(X, k, max_iter=100, tol=1e-4):
n_samples = X.shape[0]
rng = np.random.default_rng(42)
indices = rng.choice(n_samples, size=k, replace=False)
centroids = X[indices].copy()
for iteration in range(max_iter):
# Step 1: assign each point to nearest centroid
distances = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2)
labels = np.argmin(distances, axis=1)
# Step 2: recompute centroids
new_centroids = np.array([
X[labels == j].mean(axis=0) if np.any(labels == j) else centroids[j]
for j in range(k)
])
# Convergence check
if np.linalg.norm(new_centroids - centroids) < tol:
break
centroids = new_centroids
return labels, centroids
What Atlassian grades:
- Did you seed the RNG? (Reproducibility matters in ML.)
- Did you handle the edge case of an empty cluster (no points assigned)? The naive
X[labels==j].mean()on an empty array gives NaN. - Did you discuss convergence criteria? Most candidates only use
max_iter; the senior signal is using both max_iter AND a tolerance on centroid movement. - Can you discuss the initialization sensitivity of K-means and mention K-means++? You don't have to implement K-means++, but knowing it exists separates a P5 from a P4.
Problem 2: Linear regression with gradient descent
Question: Implement linear regression using batch gradient descent. Given X (features) and y (targets), fit weights w that minimize mean squared error.
import numpy as np
def linear_regression(X, y, lr=0.01, n_iters=1000):
n_samples, n_features = X.shape
# Add bias column
X_b = np.hstack([np.ones((n_samples, 1)), X])
w = np.zeros(n_features + 1)
for i in range(n_iters):
predictions = X_b @ w
errors = predictions - y
gradient = (2 / n_samples) * X_b.T @ errors
w -= lr * gradient
return w # [bias, w1, w2, ...]
The conversation that matters:
- "Why is the gradient
(2/n) X^T (Xw - y)?" Derive it from MSE loss(1/n) sum (Xw - y)^2. The factor of 2 comes from the squared term derivative. Many candidates have memorized the formula without being able to derive it. - "What if we wanted L2 regularization?" Add
lambda * wto the gradient (with bias unregularized). - "When would you use this vs. the closed-form normal equation?" Closed-form is faster for small datasets but inverts an n_features x n_features matrix; gradient descent scales better when n_features is large or when X^T X is singular.
- "When would you use stochastic gradient descent instead of batch?" When the dataset doesn't fit in memory, or when you want the stochastic noise to help escape local minima (though linear regression has no local minima, this matters for neural nets).
Problem 3: Collaborative filtering recommendation
Question: Given a sparse user-item rating matrix, implement matrix factorization with stochastic gradient descent to predict missing ratings.
import numpy as np
def matrix_factorize(R, mask, k=10, lr=0.01, n_epochs=20, reg=0.02):
"""R is the user-item matrix (sparse), mask is 1 where R is observed."""
n_users, n_items = R.shape
rng = np.random.default_rng(42)
P = rng.normal(scale=0.1, size=(n_users, k)) # user factors
Q = rng.normal(scale=0.1, size=(n_items, k)) # item factors
for epoch in range(n_epochs):
for u in range(n_users):
for i in range(n_items):
if mask[u, i]:
err = R[u, i] - P[u] @ Q[i]
P[u] += lr * (err * Q[i] - reg * P[u])
Q[i] += lr * (err * P[u] - reg * Q[i])
return P, Q
This is a classic Atlassian ask because their Jira and Confluence products rely heavily on recommendations (related issues, suggested mentions, page recommendations). Showing you understand the math behind a recommender system is more valuable than knowing scikit-learn's surprise library.
The discussion that turns the page:
- "Why initialize from a Gaussian, not zeros?" If both matrices start at zero, gradients are zero everywhere, and SGD never moves.
- "What's the role of the regularization term?" Prevents the factor magnitudes from blowing up, especially for users or items with few observations.
- "How would you cold-start a new user with no ratings?" Use content-based features as a fallback, or sample from the global distribution of P vectors.
Problem 4: Classical algorithm round - top K frequent words
Even in the ML loop, you get at least one classical algorithm problem. This is Atlassian's most-asked: top K frequent words.
Question: Given a list of strings and an integer k, return the k most frequent words. Sort by frequency descending; break ties lexicographically.
from collections import Counter
import heapq
def topKFrequent(words, k):
count = Counter(words)
# Max-heap by (-freq, word) so most frequent comes first, lexicographic ties
heap = [(-freq, word) for word, freq in count.items()]
heapq.heapify(heap)
return [heapq.heappop(heap)[1] for _ in range(k)]
Atlassian-specific follow-up: "Now imagine these aren't all in memory - they arrive as a stream of millions per second. How does your solution change?" Expected pivot: Count-Min Sketch with a top-k heap, or Misra-Gries heavy hitters. Same conversation as the Netflix top-K problem; the streaming-systems familiarity is what they're checking.
Problem 5: Model debugging round (the unusual one)
This isn't a coding-from-scratch problem. The interviewer shares a notebook or screen with a model that's underperforming and asks you to diagnose.
Scenario: "Here's a logistic regression model trying to predict whether a Jira ticket will be resolved within 7 days. Train accuracy is 95%, test accuracy is 65%. What do you do?"
The expected diagnostic walk:
- Check for overfitting. 95% train, 65% test - classic. Look at validation curves over training epochs.
- Check feature leakage. Are any features computed using post-resolution data (like "comment count after resolution")? This is the most common production cause.
- Check class imbalance. If 90% of tickets are unresolved at 7 days, a model that always predicts "no" would get 90% accuracy. Look at precision, recall, and confusion matrix instead.
- Inspect feature importances. If one feature has 80% of the weight and that feature is essentially the label, that's the leak.
- Cross-validate properly. Time-aware split, not random k-fold - random folds leak future info into training in time-series problems.
- If all of the above check out, regularize harder or simplify the model. A 65% test logistic regression might mean the features just aren't predictive enough; you need better features, not a better model.
What separates levels here:
- P4: Names the overfitting hypothesis correctly, suggests cross-validation. Hire.
- P5: Walks the full diagnostic checklist, finds the leakage, suggests how to redesign the feature pipeline to prevent it. Hire-strong.
- P6: Same as P5, plus discusses the productionization implications - monitoring for distribution drift, A/B testing the fix, owning the feedback loop. Hire-strong, possible level-up.
The Values round
Atlassian's five values, with the question patterns each tends to surface:
| Value | Typical question | What they grade |
|---|---|---|
| Open Company, No Bullshit | "Tell me about a time you delivered hard feedback or pushed back on a senior decision." | Directness; willingness to be uncomfortable |
| Build with Heart and Balance | "Tell me about a project where you had to choose between shipping and quality." | Long-term thinking; not burning out the team |
| Don't BS the Customer | "Tell me about a customer-facing mistake you made and how you handled it." | Customer empathy; transparency |
| Play as a Team | "Tell me about a conflict with a peer and how you resolved it." | Collaboration without ego; conflict navigation |
| Be the Change You Seek | "Tell me about something you improved that wasn't your responsibility." | Initiative; agency over scope |
Prepare one specific STAR story per value. They will probe: "what was your specific role," "what would you do differently," "how did the other people involved react." Vague stories get downgraded.
Common pitfalls in the ML rounds
- Reaching for scikit-learn in the from-scratch round. The interviewer will stop you. Practice writing K-means, linear regression, and gradient descent from a blank file. Don't rely on libraries.
- Skipping the data inspection step in the debugging round. Always ask "can I see the data distribution" before suggesting model changes. Atlassian wants ML engineers who debug from data, not from models.
- Not knowing the products. Atlassian makes Jira, Confluence, Bitbucket, Trello, and Atlassian Intelligence. The hiring manager round will ask if you've used them and how. Have specific feedback ready - even constructive criticism is welcome (it's literally the Open Company value).
- Treating the Values round as a formality. A "fine" performance in the Values round combined with strong technical signal can still be a no-hire. Take it as seriously as the coding rounds.
- Forgetting Atlassian is Sydney-based. Time zones matter for remote roles. If you're US-based and applying for a US team, expect early-morning syncs with Sydney engineers. Acknowledge this in the hiring manager round.
What P4, P5, and P6 actually look like in the same problem
Consider the linear regression from-scratch problem. How responses differ by level:
- P4 (mid): Implements correctly, can explain the gradient. May not immediately see L2 regularization without prompting. Hire if signal is consistent.
- P5 (senior): Implements correctly, derives the gradient from loss, discusses regularization, batch vs SGD trade-offs, when to use this vs. normal equation. Hire-strong.
- P6 (staff): All of P5, plus discusses productionization: how would this fit in a feature store pipeline, what monitoring would you add, when would you retrain. The coding is the same; the conversation around it doubles in scope. Hire-strong, considered for staff.
A 14-day prep plan for an Atlassian ML loop
- Days 1-2: Drill from-scratch ML: implement K-means, linear regression with gradient descent, logistic regression, a basic decision tree, and matrix factorization. NumPy only.
- Days 3-4: Classical algorithms. Top 25 problems from our LeetCode patterns post, with extra weight on heap, hashmap, and graph problems.
- Days 5-7: ML system design. Pick 3 systems and design them end-to-end: a recommendation system, a fraud detection system, an embedding service. Cover data, training, serving, monitoring.
- Days 8-9: Model debugging practice. Take public Kaggle notebooks with poor performance and write a debugging report for each.
- Days 10-11: Behavioral / Values prep. Write one specific STAR story per Atlassian value. Practice telling each in under 4 minutes.
- Days 12-13: Mock interview each round type once. Use our solo mock interview techniques if you don't have a partner.
- Day 14: Rest. Reread one favorite ML paper to keep your conversational gear loose. Don't drill.
Practice ML and behavioral rounds with AI live during your Atlassian Zoom interview
CoPilot Interview surfaces structured answers in about 4 seconds during real Zoom calls. Free for Windows and macOS, invisible on screen-share.
Download freeFAQ
What does the Atlassian ML coding interview look like?
Three to five technical rounds depending on level. The coding rounds blend classical algorithm problems (one or two LeetCode mediums) with from-scratch ML implementations (k-means, linear regression with gradient descent, simple decision tree). At least one round is dedicated to ML system design or ML model debugging. Plus a Values round that all Atlassian engineers go through.
Do I need to implement ML algorithms from scratch?
Yes, expect at least one from-scratch implementation. Common asks: implement k-means clustering, linear regression with gradient descent, a basic neural network forward pass, or a recommendation algorithm using collaborative filtering. You will use NumPy at most; you cannot use scikit-learn, PyTorch, or TensorFlow. The point is testing whether you understand the math behind the libraries you use daily.
What are Atlassian's company values and why do they matter?
Five values: Open Company No Bullshit, Build with Heart and Balance, Don't BS the Customer, Play as a Team, Be the Change You Seek. The Values round explicitly grades fit against these, particularly the first and fourth - directness and collaboration. They are not a formality; engineers have been declined for technical-but-not-values-aligned signal.
What is the difference between Atlassian P4, P5, and P6?
P4 is mid-level engineer (3-5 years experience). P5 is senior, owns features end-to-end, mentors P4s. P6 is staff, drives multi-team initiatives, owns architectural decisions. For ML engineers, P5 expects you to ship a model to production end-to-end (data pipeline, training, deployment, monitoring); P6 expects you to design the ML platform other teams use. Coding bar is similar across P4/P5; system design bar diverges sharply.
Is the Atlassian interview remote or in-person?
Atlassian was the first major tech company to declare permanent fully-remote (their TEAM Anywhere policy). All interviews are conducted via Zoom unless you explicitly request in-office at their Sydney, San Francisco, Austin, or Bengaluru offices. The interview format is identical either way.