Comparing Embedding Models for a Knowledge Graph (Part A: The 56-Entry Test)

What We’re Trying to Do

The project has 16,282 “Words of Brandon” (WoB) Q&A entries from fan events. Each entry was hand-tagged on Arcanum (a community wiki) with entity names like “kaladin”, “allomancy”, “roshar”. But tagging is inconsistent — many entries discuss an entity without being tagged with it, and some entity names are common English words (“rock”, “honor”, “ruin”) that create false matches when you try to find them with text search.

We want to use embedding models to solve both problems: discover missing connections AND avoid false matches. The question is: which embedding model does this best?

This is Part A of two posts. Here we design three hand-crafted tests and pick a winner. In Part B, we’ll question whether 56 labels were enough — and build a cross-validation framework on all 7,813 tagged entries that tells a different story.

How the Test Is Set Up

Step 1: Embed every entry (04_embed_entries.py)

Each of the 16,282 entries gets sent to an embedding API. The model reads the text and returns a vector — a list of numbers (e.g., 3072 numbers for Azure OpenAI, 1024 for Voyage). Entries that discuss similar topics will have vectors that point in similar directions.

We tested 4 models:

Model	Dimensions	Notes
Azure OpenAI `text-embedding-3-large`	3072	Microsoft-hosted OpenAI model, largest general-purpose
Azure Cohere `embed-v4`	1024	Cohere’s latest, strong on multilingual
Google Gemini `embedding-001`	3072	Google’s embedding model
Voyage `voyage-4`	1024	Specialist retrieval model, often tops benchmarks

All 16K entries were embedded with each model. The results are cached as .npy files (numpy arrays) so this only needs to happen once.

Step 2: Build entity reference vectors

For each of the 622 entity tags (excluding meta/book types), the script collects all entries that Arcanum curators explicitly tagged with that entity. It requires at least 5 such entries (to avoid noisy references). It then averages those entry embeddings into a single reference vector per entity.

This gives us 367 entity reference vectors — the “semantic fingerprint” of what discussions about each entity look like. The averaging is done after L2-normalization, so each reference vector represents the centroid of its tagged entries in the embedding space.

Why average? Think of it geometrically. Each entry embedding is a point on a high-dimensional sphere (after L2 normalization). All the entries tagged “kaladin” form a cluster on that sphere. The average of those points gives you the center of that cluster — the direction in embedding space that most represents “Kaladin-ness.” When we later ask “is this new entry about Kaladin?”, we measure how close it is to that center.

The 5-entry minimum matters because with fewer points, your “center” could be dominated by one unusual entry. With 5+, the average is more stable.

Step 3: Compute the similarity matrix

For each model, we compute a giant matrix of shape (16,282 entries x 367 entities). Each cell is the cosine similarity between an entry’s embedding and an entity’s reference vector. Cosine similarity ranges from -1 to 1, where 1 means the vectors point in exactly the same direction (maximally similar) and 0 means they’re orthogonal (unrelated).

Cosine similarity explained: vectors on a unit circle showing high similarity (small angle) vs low similarity (large angle)

In practice, most values fall between 0.3 and 0.95. The question is: above what threshold should we say “this entry is about this entity”?

Test 1: Disambiguation Accuracy

What it measures

Can the model tell the difference between the Cosmere entity and the common English word?

The ground truth

We hand-labeled 56 WoB entries across 9 deliberately ambiguous entities:

Entity	Cosmere meaning	Common English meaning	Examples
rock	Horneater character from Bridge Four	geological material	”How would you pronounce Rock’s full name?” vs. “beautiful red rock formations”
honor	A Shard (divine entity) of Roshar	the English concept of honor	”Could Honor and Autonomy be considered opposites” vs. “it has been an honor”
lift	Young Edgedancer character	the verb “to lift"	"Is Lift one-of-a-kind?” vs. “strong enough to lift”
ruin	A Shard of Scadrial	to destroy/spoil	”Can Odium influence people the same way Ruin can?” vs. “I don’t wanna ruin The Emperor’s Soul”
stick	A famously stubborn stick from Shallan’s Soulcasting scene	a common object	”What is Stick’s telos?” vs. “If you drew a stick figure”
marsh	Kelsier’s brother, an Inquisitor	wetland (though this usage didn’t appear)	“Where’s Marsh?” / “Is Marsh happy?“
focus	A Cosmere-specific magic system concept	the English verb “to focus"	"What is the definition of a focus (in the Ars Arcanum)?” vs. “I generally focus on one character”
color	A magic system mechanic (BioChromatic Breath)	physical color	”when the color’s pulled out of something” vs. “Skin color is a terrible reason to judge someone”
spook	Mistborn character	the verb “to spook/frighten"	"Was Spook still alive when…” vs. “something he does spooked her”

Each label says: this specific entry either is or is not about the Cosmere entity.

How the test works

For each entry in the ground truth, the script looks up its cosine similarity to the entity’s reference vector. If the similarity is above a threshold, it predicts “yes, this entry is about the entity.” Then it compares that prediction to the hand label.

The metrics: Precision, Recall, and F1

These are the three standard metrics for binary classification:

Precision = “Of all the entries the model said were about the entity, how many actually were?”

Precision = True Positives / (True Positives + False Positives)

A precision of 1.00 means: every time the model said “this is about Rock the character,” it was right. No false alarms. But it might have missed some genuine entries.

Recall = “Of all the entries that actually are about the entity, how many did the model find?”

Recall = True Positives / (True Positives + False Negatives)

A recall of 1.00 means: the model found every single entry about Rock. But it might have also incorrectly flagged some non-Rock entries.

F1 = The harmonic mean of precision and recall. It balances both:

F1 = 2 * (Precision * Recall) / (Precision + Recall)

F1 = 1.00 is perfect. F1 = 0 means total failure. The harmonic mean (rather than arithmetic mean) penalizes models that sacrifice one metric for the other — you can’t get a high F1 by having 100% recall but 10% precision.

Why the threshold matters so much. A low threshold (0.4) means “accept anything even vaguely similar” — high recall (you catch everything) but low precision (lots of false positives). A high threshold (0.8) means “only accept very confident matches” — high precision (few mistakes) but low recall (you miss a lot). The ideal threshold sits in the sweet spot where F1 peaks.

How threshold affects precision and recall: low threshold catches everything but adds noise, high threshold is precise but misses entries

Different models peak at different thresholds because their similarity distributions are different. Azure OpenAI might produce scores of 0.6-0.8 for relevant entries, while Gemini might produce 0.85-0.97 for the same entries. The threshold is not comparable across models — it’s model-specific.

What we learned

Threshold sweep results (overall F1):

Threshold	Azure OpenAI	Azure Cohere	Gemini	Voyage
0.40	0.77	0.78	0.72	0.81
0.50	0.86	0.90	0.72	0.85
0.55	0.90	0.93	0.72	0.83
0.60	0.93	0.88	0.72	0.70
0.65	0.78	0.76	0.72	0.56
0.70	0.71	0.65	0.78	0.58
0.75	0.31	0.36	0.84	0.41
0.80	0.07	0.07	0.95	0.20
0.85	0.07	—	0.74	0.07

F1 vs. threshold curves for all 4 models, showing each model's peak and Gemini's anomalous flat region

Azure OpenAI: Peaks at F1=0.93 (threshold 0.6). Has a wide useful range — 0.5 through 0.65 are all above 0.78. This means you have room to tune the threshold without destroying performance.

Azure Cohere: Peaks at F1=0.93 (threshold 0.55). Very similar to Azure OpenAI but peaks slightly earlier and drops slightly faster.

Gemini: Flat at 0.72 for thresholds 0.4-0.7, then rises to 0.95 at 0.8. This is bizarre and reveals a fundamental problem. At threshold 0.7, Gemini has recall=1.00 (finds everything) but precision=0.64 (36% of its matches are wrong). The similarity scores are so compressed that even unrelated entries score above 0.7.

Voyage: Peaks at F1=0.85 (threshold 0.5) but drops steeply. By 0.6 it’s already at 0.70. At 0.7, it gets -- (zero recall) for honor, lift, ruin, and marsh — meaning those entities’ true-positive entries all scored below 0.7. Its useful range is narrow: essentially just 0.4-0.5.

What “focus” reveals about the test. Every model scored -- for “focus” — no entity reference embedding was generated because “focus” as a Cosmere concept had fewer than 5 explicitly tagged entries. This is an honest limitation of the test: entities with sparse tagging on Arcanum simply can’t be evaluated. The ground truth has 4 positive and 4 negative examples for “focus,” but without a reference embedding, no model can even attempt it.

What Gemini’s flat F1 reveals. The fact that Gemini’s F1 barely changes from 0.4 to 0.7 means its similarity distribution is extremely compressed. Imagine a model where every entry scores between 0.90 and 0.98 against every entity. Lowering the threshold from 0.7 to 0.4 changes nothing because nothing scored that low to begin with. This makes the model effectively useless for threshold-based systems — you can’t draw a meaningful line between “related” and “unrelated.”

Score distributions for Azure OpenAI (wide gap between relevant and irrelevant) vs Gemini (heavily overlapping)

Test 2: Similarity Quality

What it measures

For 10 well-known entities, what are the 5 most similar entities? This is a qualitative “smell test” — do the results make Cosmere sense?

How it works

The script takes each entity’s reference vector and computes cosine similarity against all other entity reference vectors. It ranks them and shows the top 5.

What we learned

Let me walk through a few telling examples:

Hoid (the cosmere’s most connected character — a worldhopper who appears across multiple series):

Model	Top 5
Azure OpenAI	worldhoppers (0.90), frost (0.85), vasher (0.85), yolish lightweaving (0.85), worldhopping (0.84)
Cohere	worldhoppers (0.87), dalinar (0.86), frost (0.86), kaladin (0.85), kelsier (0.85)
Gemini	worldhoppers (0.95), frost (0.94), cosmere (0.94), worldhopping (0.94), scadrial (0.93)
Voyage	sigzil (0.82), worldhoppers (0.82), frost (0.81), yolish lightweaving (0.81), vasher (0.79)

Azure OpenAI and Voyage both correctly identify the cosmere-lore connections (Frost, Vasher, Yolish Lightweaving). Voyage uniquely puts Sigzil first — Hoid’s apprentice — which is an impressively specific connection. Cohere pollutes its results with generic main characters (Dalinar, Kaladin, Kelsier) that aren’t especially Hoid-related.

Shallan (Stormlight character, lightweaver, scholar):

Model	Top 5
Azure OpenAI	jasnah (0.88), dalinar (0.87), kaladin (0.87), knights radiant (0.87), adolin (0.85)
Voyage	jasnah (0.84), shashara (0.82), dalinar (0.82), eshonai (0.82), sylphrena (0.82)

Voyage again surfaces interesting cross-series connections. Shashara (a scholar/lightweaver-analog from Warbreaker) is a deep thematic parallel to Shallan. Eshonai and Sylphrena are also more specific than the generic “dalinar, kaladin” that other models produce.

The dynamic range problem: Look at Gemini’s scores. For Hoid, the top 5 range from 0.93 to 0.95. For Azure OpenAI, they range from 0.84 to 0.90. This matters because in the frontend, users adjust sliders to explore relationships. If everything is 0.93-0.95, the sliders have almost no effect. Azure OpenAI’s wider spread (0.07 gap between #1 and #5 vs Gemini’s 0.02 gap) gives the UI meaningful resolution.

Test 3: Discovery Power

What it measures

If we use this model to implicitly tag all 16K entries (anything above the threshold gets tagged), how many new tags and edges do we discover?

How it works

For every entry, the script checks its similarity to all 367 entities. If the similarity exceeds the threshold AND the entry wasn’t already explicitly tagged with that entity, it counts as a “new implicit tag.” Then it rebuilds the co-occurrence graph (two entities share an edge if they co-occur in the same entry) and measures the resulting connectivity.

What we learned (threshold = 0.7)

Metric	Azure OpenAI	Azure Cohere	Gemini	Voyage
New implicit tags	15,100	4,463	4,216,588	15,161
Entries newly tagged	1,688	1,242	6,875	1,848
Total edges	9,422	1,967	103,623	7,122
Isolated nodes	218	242	109	226

Discovery power comparison: Gemini's 4.2M implicit tags dwarf the other models (note log scale)

Gemini generated 4.2 million implicit tags. That’s roughly 16K entries x 260 entities each — it tagged nearly every entry with nearly every entity. Its top entities each got ~14,500 additions out of 16,282 total entries. That’s not discovery, that’s a model saying “everything is related to everything.” The 103K edges it produced would turn the graph into an unreadable hairball.

Azure OpenAI and Voyage produced very similar numbers of implicit tags (~15K). But Azure OpenAI created more edges (9,422 vs 7,122) and fewer isolated nodes (218 vs 226), meaning it produces a more connected graph.

Azure Cohere was the most conservative — only 4,463 new tags and 1,967 edges. It barely extends the graph.

The “top entities” lists reveal model personality:

Azure OpenAI’s top implicit tags are broad cosmere concepts: “cosmere sequence” (+435), “realmatic theory” (+364), “investiture” (+303). These are reasonable — many WoB entries discuss cosmere mechanics without being explicitly tagged.

Voyage’s list is similar but heavier on meta-concepts: “cosmere sequence” (+970), “conflux novels” (+564). The “conflux novels” result is concerning — that’s a meta tag, not a true semantic connection.

Gemini’s list is nonsensical: “taldain” (+14,824), “nalthis” (+14,796). It tagged nearly every entry with nearly every world name.

Discovery Power reveals the “Goldilocks zone” problem. Too few implicit tags (Cohere: 4K) means you’re barely improving over explicit tags alone. Too many (Gemini: 4.2M) means you’ve drowned the signal in noise. The right model finds a middle ground where the new connections are genuinely useful — not so aggressive that “Paalm” gets tagged on 14,466 entries (Gemini), but aggressive enough to rescue isolated nodes.

The “entries newly tagged” metric is particularly telling. These are entries that had ZERO explicit tags and got assigned to an entity purely via embedding similarity. Azure OpenAI rescued 1,688 and Voyage rescued 1,848 of these orphans. But Gemini “rescued” 6,875 — by indiscriminately tagging everything.

Summary: Why Azure OpenAI Won

Criterion	Azure OpenAI	Cohere	Gemini	Voyage
Peak F1	0.93	0.93	0.95*	0.85
Usable threshold range	0.50 - 0.65	0.50 - 0.60	0.75 - 0.85	0.40 - 0.50
Score dynamic range	Wide	Medium	Very narrow	Medium
Discovery (new tags)	15,100	4,463	4,216,588	15,161
Graph edges	9,422	1,967	103,623	7,122
Similarity quality	Good	Generic	Good but compressed	Best (cross-series)

*Gemini’s F1=0.95 is at threshold 0.8, where it still produces 4M+ false tags.

Azure OpenAI is the best all-around choice for this project because:

It has the highest F1 at a practical threshold (0.93 at 0.60)
Its wide usable threshold range (0.50-0.65) gives the frontend sliders room to work
Its wider score distribution means slider adjustments produce visible changes in the graph
It generates the most graph edges (9,422) without generating noise
Its similarity rankings are sensible (not as surprising as Voyage’s, but reliable)

Voyage is genuinely impressive for qualitative similarity (Sigzil for Hoid, Shashara for Shallan) but its narrow usable threshold range makes it a poor fit for a system where users control the threshold via sliders.

Cohere is too conservative — it barely extends the graph beyond explicit tags.

Gemini has a fundamentally compressed similarity space that makes it unsuitable for threshold-based classification at any practical threshold.

But is this the full story? We evaluated on 56 hand-labeled entries across 9 entities — 0.3% of our data, and specifically the hardest 0.3%. In Part B, we build a proper cross-validation framework on all 7,813 tagged entries and discover that the rankings change completely.