HomeBlog › Atlassian ML Coding Interview

Atlassian ML Coding Interview: What to Expect

Classical algorithms plus from-scratch ML. Five representative problems, the Values round nobody preps for, and the P5 vs P6 split that decides your level.

If you've been preparing for a Google or Meta ML engineering interview, the Atlassian loop will feel both more practical and more idiosyncratic. The classical algorithm round is similar - LeetCode mediums, talk through your reasoning, write clean code. But Atlassian will also ask you to implement an ML algorithm from scratch in NumPy, debug an underperforming model with you as the only engineer in the room, and pass a dedicated Values round that has cost technically-strong candidates the offer.

This guide walks through five representative problems from the Atlassian ML engineer loop, how the ML system design round works, and the level differentiation between P4, P5, and P6 for ML roles.

How Atlassian's ML loop differs from the rest

Three notable structural choices:

1. Implement-from-scratch is mandatory. At Google or Meta you can typically pass an ML interview using pseudo-code and discussing your approach. Atlassian requires at least one round where you write working NumPy code from a blank file. K-means, linear regression with gradient descent, or a recommendation algorithm using collaborative filtering are the three most common asks. No scikit-learn, no PyTorch.

2. The Values round is real. Atlassian famously articulates five company values, and one round of your loop is dedicated to grading you against them. The most-asked value pair is "Open Company No Bullshit" and "Play as a Team" - they want to see directness combined with collaboration. Candidates who pass on technical signal but read as guarded or self-aggrandizing in the Values round get declined.

3. Fully remote, no exceptions. Atlassian was the first major tech company to declare permanent fully-remote work (their TEAM Anywhere policy). Interviews are conducted on Zoom or Google Meet. There's no in-office pressure, but it also means the interviewer is judging your remote-work signal: how you present on video, how you handle a screen-share, how you write code in a shared editor.

The interview process for ML engineering roles

StageFormatFocus
Recruiter screen30 minBackground, target team, comp expectations
Hiring manager screen45 minRecent ML projects, team fit. Often includes one easy ML conceptual question
Technical phone screen60 min1 LeetCode-style problem + 1 ML conceptual / from-scratch coding
Onsite Round 1: Coding60 min2 problems: 1 classical algorithm + 1 ML implementation from scratch
Onsite Round 2: ML system design60 minDesign an ML system end-to-end (data, training, serving, monitoring)
Onsite Round 3: Model debugging / case study60 minGiven a poorly-performing model, diagnose and propose fixes
Values round45 minBehavioral against the 5 Atlassian values
Hiring manager wrap30 minQ&A, sell mode

Problem 1: Implement K-means clustering from scratch

Question: Given an array of 2D points and an integer k, implement K-means clustering. Return the final cluster assignments and centroid positions. Use NumPy only - no scikit-learn.

import numpy as np

def kmeans(X, k, max_iter=100, tol=1e-4):
    n_samples = X.shape[0]
    rng = np.random.default_rng(42)
    indices = rng.choice(n_samples, size=k, replace=False)
    centroids = X[indices].copy()

    for iteration in range(max_iter):
        # Step 1: assign each point to nearest centroid
        distances = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2)
        labels = np.argmin(distances, axis=1)

        # Step 2: recompute centroids
        new_centroids = np.array([
            X[labels == j].mean(axis=0) if np.any(labels == j) else centroids[j]
            for j in range(k)
        ])

        # Convergence check
        if np.linalg.norm(new_centroids - centroids) < tol:
            break
        centroids = new_centroids

    return labels, centroids

What Atlassian grades:

Problem 2: Linear regression with gradient descent

Question: Implement linear regression using batch gradient descent. Given X (features) and y (targets), fit weights w that minimize mean squared error.

import numpy as np

def linear_regression(X, y, lr=0.01, n_iters=1000):
    n_samples, n_features = X.shape
    # Add bias column
    X_b = np.hstack([np.ones((n_samples, 1)), X])
    w = np.zeros(n_features + 1)

    for i in range(n_iters):
        predictions = X_b @ w
        errors = predictions - y
        gradient = (2 / n_samples) * X_b.T @ errors
        w -= lr * gradient

    return w  # [bias, w1, w2, ...]

The conversation that matters:

Problem 3: Collaborative filtering recommendation

Question: Given a sparse user-item rating matrix, implement matrix factorization with stochastic gradient descent to predict missing ratings.

import numpy as np

def matrix_factorize(R, mask, k=10, lr=0.01, n_epochs=20, reg=0.02):
    """R is the user-item matrix (sparse), mask is 1 where R is observed."""
    n_users, n_items = R.shape
    rng = np.random.default_rng(42)
    P = rng.normal(scale=0.1, size=(n_users, k))  # user factors
    Q = rng.normal(scale=0.1, size=(n_items, k))  # item factors

    for epoch in range(n_epochs):
        for u in range(n_users):
            for i in range(n_items):
                if mask[u, i]:
                    err = R[u, i] - P[u] @ Q[i]
                    P[u] += lr * (err * Q[i] - reg * P[u])
                    Q[i] += lr * (err * P[u] - reg * Q[i])
    return P, Q

This is a classic Atlassian ask because their Jira and Confluence products rely heavily on recommendations (related issues, suggested mentions, page recommendations). Showing you understand the math behind a recommender system is more valuable than knowing scikit-learn's surprise library.

The discussion that turns the page:

Problem 4: Classical algorithm round - top K frequent words

Even in the ML loop, you get at least one classical algorithm problem. This is Atlassian's most-asked: top K frequent words.

Question: Given a list of strings and an integer k, return the k most frequent words. Sort by frequency descending; break ties lexicographically.

from collections import Counter
import heapq

def topKFrequent(words, k):
    count = Counter(words)
    # Max-heap by (-freq, word) so most frequent comes first, lexicographic ties
    heap = [(-freq, word) for word, freq in count.items()]
    heapq.heapify(heap)
    return [heapq.heappop(heap)[1] for _ in range(k)]

Atlassian-specific follow-up: "Now imagine these aren't all in memory - they arrive as a stream of millions per second. How does your solution change?" Expected pivot: Count-Min Sketch with a top-k heap, or Misra-Gries heavy hitters. Same conversation as the Netflix top-K problem; the streaming-systems familiarity is what they're checking.

Problem 5: Model debugging round (the unusual one)

This isn't a coding-from-scratch problem. The interviewer shares a notebook or screen with a model that's underperforming and asks you to diagnose.

Scenario: "Here's a logistic regression model trying to predict whether a Jira ticket will be resolved within 7 days. Train accuracy is 95%, test accuracy is 65%. What do you do?"

The expected diagnostic walk:

  1. Check for overfitting. 95% train, 65% test - classic. Look at validation curves over training epochs.
  2. Check feature leakage. Are any features computed using post-resolution data (like "comment count after resolution")? This is the most common production cause.
  3. Check class imbalance. If 90% of tickets are unresolved at 7 days, a model that always predicts "no" would get 90% accuracy. Look at precision, recall, and confusion matrix instead.
  4. Inspect feature importances. If one feature has 80% of the weight and that feature is essentially the label, that's the leak.
  5. Cross-validate properly. Time-aware split, not random k-fold - random folds leak future info into training in time-series problems.
  6. If all of the above check out, regularize harder or simplify the model. A 65% test logistic regression might mean the features just aren't predictive enough; you need better features, not a better model.

What separates levels here:

The Values round

Atlassian's five values, with the question patterns each tends to surface:

ValueTypical questionWhat they grade
Open Company, No Bullshit"Tell me about a time you delivered hard feedback or pushed back on a senior decision."Directness; willingness to be uncomfortable
Build with Heart and Balance"Tell me about a project where you had to choose between shipping and quality."Long-term thinking; not burning out the team
Don't BS the Customer"Tell me about a customer-facing mistake you made and how you handled it."Customer empathy; transparency
Play as a Team"Tell me about a conflict with a peer and how you resolved it."Collaboration without ego; conflict navigation
Be the Change You Seek"Tell me about something you improved that wasn't your responsibility."Initiative; agency over scope

Prepare one specific STAR story per value. They will probe: "what was your specific role," "what would you do differently," "how did the other people involved react." Vague stories get downgraded.

Common pitfalls in the ML rounds

What P4, P5, and P6 actually look like in the same problem

Consider the linear regression from-scratch problem. How responses differ by level:

A 14-day prep plan for an Atlassian ML loop

  1. Days 1-2: Drill from-scratch ML: implement K-means, linear regression with gradient descent, logistic regression, a basic decision tree, and matrix factorization. NumPy only.
  2. Days 3-4: Classical algorithms. Top 25 problems from our LeetCode patterns post, with extra weight on heap, hashmap, and graph problems.
  3. Days 5-7: ML system design. Pick 3 systems and design them end-to-end: a recommendation system, a fraud detection system, an embedding service. Cover data, training, serving, monitoring.
  4. Days 8-9: Model debugging practice. Take public Kaggle notebooks with poor performance and write a debugging report for each.
  5. Days 10-11: Behavioral / Values prep. Write one specific STAR story per Atlassian value. Practice telling each in under 4 minutes.
  6. Days 12-13: Mock interview each round type once. Use our solo mock interview techniques if you don't have a partner.
  7. Day 14: Rest. Reread one favorite ML paper to keep your conversational gear loose. Don't drill.

Practice ML and behavioral rounds with AI live during your Atlassian Zoom interview

CoPilot Interview surfaces structured answers in about 4 seconds during real Zoom calls. Free for Windows and macOS, invisible on screen-share.

Download free

FAQ

What does the Atlassian ML coding interview look like?

Three to five technical rounds depending on level. The coding rounds blend classical algorithm problems (one or two LeetCode mediums) with from-scratch ML implementations (k-means, linear regression with gradient descent, simple decision tree). At least one round is dedicated to ML system design or ML model debugging. Plus a Values round that all Atlassian engineers go through.

Do I need to implement ML algorithms from scratch?

Yes, expect at least one from-scratch implementation. Common asks: implement k-means clustering, linear regression with gradient descent, a basic neural network forward pass, or a recommendation algorithm using collaborative filtering. You will use NumPy at most; you cannot use scikit-learn, PyTorch, or TensorFlow. The point is testing whether you understand the math behind the libraries you use daily.

What are Atlassian's company values and why do they matter?

Five values: Open Company No Bullshit, Build with Heart and Balance, Don't BS the Customer, Play as a Team, Be the Change You Seek. The Values round explicitly grades fit against these, particularly the first and fourth - directness and collaboration. They are not a formality; engineers have been declined for technical-but-not-values-aligned signal.

What is the difference between Atlassian P4, P5, and P6?

P4 is mid-level engineer (3-5 years experience). P5 is senior, owns features end-to-end, mentors P4s. P6 is staff, drives multi-team initiatives, owns architectural decisions. For ML engineers, P5 expects you to ship a model to production end-to-end (data pipeline, training, deployment, monitoring); P6 expects you to design the ML platform other teams use. Coding bar is similar across P4/P5; system design bar diverges sharply.

Is the Atlassian interview remote or in-person?

Atlassian was the first major tech company to declare permanent fully-remote (their TEAM Anywhere policy). All interviews are conducted via Zoom unless you explicitly request in-office at their Sydney, San Francisco, Austin, or Bengaluru offices. The interview format is identical either way.