Asking someone to rank 15 items is a bad survey question. The data is valuable, sure. The task is just painful. People give up, rush through it, or pick whatever order gets them to the next page.

We knew we needed ranking in our survey builder. The question was how to make it work without making respondents hate you.

The Problem With Traditional Ranking

Most survey tools give you a drag-and-drop list. Reorder these 12 items from best to worst. It works fine for 4 or 5 things. Beyond that, you’re asking people to hold too many comparisons in their head at once.

Research backs this up. A respondent can meaningfully differentiate between about 5-7 items in a single ranking task. Ask for more and the middle positions become noise. The respondent knows their top 2 and bottom 2, but positions 5 through 9 are random noise.

We wanted to support lists of any size, from 3 options to 50+, without degrading data quality. That meant we needed multiple algorithms, each suited to a different scale.

Four Algorithms, One Output

Every ranking method produces the same output: a score from 0 to 100 for each item. The top-ranked item gets 100, the bottom gets 0, everything else falls between. This makes the results comparable regardless of which algorithm collected them.

Here’s what we built and why.

Drag & Drop (Sort): The Obvious Starting Point

For lists of 5 or fewer items, drag-and-drop works well. The cognitive load is low, the interaction is familiar, and the results are unambiguous. You can see all items at once and physically arrange them.

We convert the final order to scores using linear interpolation. First place gets 100, last gets 0, and everything in between is evenly spaced:

score = ((total - 1 - position) / (total - 1)) * 100

For 5 items, that gives you 100, 75, 50, 25, 0.

We shuffle the initial order on each session to prevent position bias. Without shuffling, people tend to leave the first items near the top, which corrupts data when you’re averaging across respondents.

The shuffling itself matters more than you’d think. The common array.sort(() => Math.random() - 0.5) is not a uniform shuffle. It produces biased distributions that vary across JavaScript engines. We use a proper approach: assign each item a random seed, then sort by seed. For algorithms that also prioritize by a metric (like “least shown”), the seed serves as a tie-breaker.

const shuffled = array
    .map((item) => ({ item, rank: sortBy(item), seed: Math.random() }))
    .sort((a, b) => (a.rank !== b.rank ? a.rank - b.rank : a.seed - b.seed))
    .map((w) => w.item);

This guarantees uniform randomization when ranks are equal, and stable priority ordering when they differ. We use this same function across all four algorithms.

Pairwise Comparison: Borrowing From Chess

When the list grows beyond 5-6 items, drag-and-drop breaks down. But comparing two items is trivial. “Which of these two do you prefer?” takes almost no effort. That’s the core of pairwise comparison.

The challenge is scoring. If item A beats item B, and item B beats item C, does that mean A should rank above C? Usually, but not always, and the margin matters.

Simple “win percentage” fails here. If A beats three weak opponents and C beats one very strong one, win percentage says A is better. That’s wrong.

Enter Elo

The Elo rating system, originally designed for chess rankings, handles this well. Every item starts at a rating of 1000. When two items are compared, the winner gains points and the loser loses points, but the amount depends on the expected outcome.

If a highly-rated item beats a lowly-rated one, neither rating changes much. That outcome was expected. If the underdog wins, both ratings shift significantly.

The math:

Expected win probability = 1 / (1 + 10^((opponent_rating - my_rating) / 400))
Rating change = K * (actual - expected)

Where K is the volatility factor. We use K=32, the same as mid-level chess. Higher K means ratings shift faster but are less stable. Lower K means more matches needed for accurate ratings.

After all comparisons, we normalize the Elo ratings to a 0-100 scale so the output matches the other algorithms.

How Many Comparisons?

This is the core tradeoff. More comparisons mean more accurate data but more respondent fatigue. Fewer comparisons risk unreliable rankings.

In full mode (items <= 5), we show every possible pair. For 5 items, that’s 10 comparisons, which takes about a minute.

In partial mode (items > 5), we stop when each item has been seen a target number of times. The defaults:

List size	Target views per item	Estimated comparisons
5	4	10 (all pairs)
6-20	3	~9-30
20+	2	~20-25

We also never repeat a pair. Once you’ve seen A vs B, that matchup won’t appear again. This means we track seen pairs with a bidirectional key: A:B and B:A are the same pair.

Smart Pair Selection

Not all comparisons are equally useful. Comparing two items that have already been compared many times adds little information. Comparing two items with few data points is much more valuable.

Our pair selection algorithm prioritizes items with the fewest matches. It shuffles candidates by match count (fewest first, with random tie-breaking), then finds the first pair that hasn’t been seen yet.

This means early comparisons spread coverage evenly across items, and later comparisons fill in gaps. The result is that even with partial coverage, every item has been evaluated enough times for a reasonable rating.

MaxDiff: The Research-Grade Method

MaxDiff (Maximum Difference Scaling) comes from market research. Instead of comparing two items, you show a subset (typically 4-5) and ask: “Which is best? Which is worst?”

Each screen produces two data points: one positive signal and one negative signal. This makes it more efficient than pairwise per interaction, and the best/worst format anchors the scale at both ends, which reduces the “everything is above average” bias you get with rating scales.

Scoring

The raw score is straightforward: (best_count - worst_count) / times_shown. An item picked as best every time it appeared scores +1. An item picked as worst every time scores -1. We multiply by 100 and normalize to a 0-100 range for consistency.

Screen Generation

The key challenge is which items to show on each screen. Show the same items together repeatedly and you get biased comparisons. Show items unevenly and some have more data than others.

We use a balanced exposure algorithm. Each time we generate a screen, we sort items by how many times they’ve been shown (fewest first) and pick the top N. This guarantees every item gets roughly equal exposure.

Fatigue Management

The number of screens grows with list size. The formula: (items * target_views) / items_per_screen. For 15 items with 3 views each and 5 items per screen, that’s 9 screens. Manageable.

But at 50 items with 3 views and 5 per screen, you get 30 screens. That’s not a survey question; that’s a chore. So we automatically reduce target views to 2 when the estimated screen count exceeds 20. It’s a tradeoff: slightly less statistical power in exchange for a survey people actually finish.

List size	Items per screen	Target views	Total screens
4	4	3	3
10	5	3	6
15	5	3	9
30	5	2	12
50	5	2	20

Why Not Just Pairwise for Everything?

Pairwise is simpler to explain and faster per interaction. But for large lists, the number of possible pairs grows quadratically (n*(n-1)/2). For 15 items, that’s 105 pairs. Even with partial coverage, pairwise struggles to get reliable data in a reasonable number of questions.

MaxDiff scales better because each screen evaluates multiple items simultaneously. 9 screens of 5 items each gives you 45 item-level data points for 15 items. Getting equivalent coverage with pairwise would take roughly 23 comparisons.

Budget Allocation: When Trade-offs Are the Point

Budget works differently from the other three. You get 100 points and distribute them across items. Giving 40 points to Feature A and 10 to Feature B makes the relative importance explicit and intuitive.

This method is best when you want respondents to think about resources and trade-offs. “If you had a budget, where would you spend it?”

The implementation is straightforward: plus/minus buttons that increment by 5, a remaining-points counter that goes from green to yellow to red, and input validation that prevents overspending.

There’s no algorithm here in the mathematical sense. The scores are whatever the respondent assigns. We just normalize them so the highest allocation maps to 100 and the lowest to 0, keeping output consistent across methods.

Automatic Algorithm Selection

Survey creators shouldn’t need to understand the tradeoffs above. So we pick for them based on list size:

5 or fewer items: Drag & Drop. Fast, familiar, accurate enough.
6-10 items: Pairwise Comparison. Manageable question count, better data than drag-and-drop at this scale.
11+ items: MaxDiff. The only method that stays reasonable at scale.
Budget: Manual selection only. It’s a conceptually different question (“allocate resources” vs “rank by preference”).

Creators can override this if they want, but the defaults work well for most cases.

UX Decisions That Took Longer Than the Algorithms

The 1-Second Advance Delay

When a respondent picks an option in pairwise or MaxDiff, we don’t advance immediately. There’s a 1-second delay where their selection is highlighted. During that window, they can tap the same item again to undo, or tap a different item to switch.

This was not our first design. We tried instant advance (felt rushed, no chance to reconsider), a confirmation button (too many taps, slowed everything down), and a longer delay (felt sluggish). One second turns out to be the sweet spot: fast enough to maintain flow, slow enough to catch mistakes.

Progress Bars Over Step Counters

“Question 7 of 23” is technically more informative than a progress bar. But it also tells you that you’re not even a third of the way through, which is discouraging. A progress bar filling up feels like forward motion. Same information, different framing.

The Podium

When ranking is complete, we show a podium visualization (1st, 2nd, 3rd position) instead of a flat list. This gives respondents immediate, satisfying feedback: your top pick is confirmed, and you can see the runners-up.

On mobile, labels appear below the podium since the bars are too narrow for text. On desktop, labels sit inside the bars. There’s a restart button underneath if you want to go again.

Things We Got Wrong (And Fixed)

Storing Session History in the Database

Our first implementation saved the full comparison history in the database alongside the final scores. The idea was to support resuming interrupted sessions. In practice, the history was large (every pair/screen + result), rarely needed, and complicated the data model for all ranking responses.

We removed it. Now we save only the final scores and a lightweight session state (algorithm type + current step number). If a respondent returns to their answer, they see the results and can restart if they want.

The Shuffle Anti-Pattern

Early on we shuffled arrays with array.sort(() => Math.random() - 0.5). This is one of those things that looks correct but isn’t. The sort function expects a consistent comparator. A random one violates that contract, producing non-uniform distributions that vary by engine.

In V8 (Chrome/Node.js), the bias is measurable. Items at the start of the array are more likely to stay near the start after “shuffling.” For a ranking tool, this is a data quality problem.

State Management in React

Pairwise and MaxDiff run entirely client-side. The session object is mutable (tracking Elo ratings, seen pairs, etc.), but React expects immutable state. We use a ref to hold the session and state hooks for the values React needs to re-render (current pair, history, selection).

The tricky part is undo. When a respondent goes back, we can’t simply “unpick” from a mutable session. Instead, we create a fresh session and replay the history up to the previous step. This guarantees the Elo ratings and pair tracking are exactly what they were before the last vote.

It’s not the most elegant pattern, but it’s correct and testable. Each undo is a full, deterministic replay. No partial rollback state to manage.

What We Didn’t Build

Adaptive Questioning by Elo Rating

A smarter pairwise algorithm could use current Elo ratings to pick maximally informative pairs. If A is rated 1200 and B is rated 800, matching them isn’t very useful. Matching A against C (rated 1180) would produce more distinguishing information.

We considered this. The problem is that it creates a dependency between pair selection and scoring. When combined with partial coverage, it can create feedback loops where early random results permanently shape which comparisons get made. The simpler “prioritize least-seen items” approach is more reliable.

Server-Side Session Management

We evaluated running the ranking session on the server. Advantages: resume across devices, prevent response manipulation, enable real-time analytics. Disadvantages: latency on every click (pairwise needs sub-100ms responses to feel good), server cost per active survey, online-only requirement.

Client-side was the right call for a survey tool where responsiveness matters more than tamper-resistance. The final scores are what get stored. If someone wants to game their own ranking survey, that’s their business.

Tied Handling in Results

When two items have identical scores (common with few comparisons), we don’t break the tie. They share the same position. We considered secondary tie-breakers (match count, raw Elo in pairwise, best-count in MaxDiff) but decided that false precision is worse than admitting uncertainty. If your data can’t distinguish two items, saying so is more honest than inventing an order.

Numbers

Across all four algorithms, the ranking element produced 0-100 scores that can be compared and aggregated. The same downstream analytics work regardless of how the data was collected. A researcher can switch algorithms between survey runs and still compare results, because the normalization makes outputs commensurable.

For survey creators, the choice comes down to: how many items are you ranking, and do you want respondents to think about trade-offs (budget) or preferences (everything else)?

For respondents, the experience is always the same: a short, focused task that doesn’t overstay its welcome.

That’s the goal, anyway. If we got something wrong, the support channel is open.

How We Built Four Ranking Algorithms