Boosting scores based on Array Attribute Matching

I am working with a schema where a field mcat_tree is defined as an array (eg. “mcat_tree”: [“189194”, “192170”, “PID 19217”, “189194R”, “189194P”]). It is defined as an attribute field for memory/scale reasons. My goal is simple: In the ranking phase, I want to check if a specific string (“191984”) or any one of the string from input array ([“12345”, “191984”]) (passed as a query parameter) exists in that array. If it exists, I want to apply a significant boost (e.g., +100 to the score).

The Problem: While searching/filtering in YQL using contains works perfectly, I’ve struggled to find a clean, performant way to do this inside a rank-profile using only the array attribute. Nothing seems to be working out for one reason or the other - the only solution that has worked reliably is duplicating the data into a mapped tensor: field mcat_tree_tensor type tensor(mcat_tree{})

And then ranking using: sum(query(my_param) * attribute(mcat_tree_tensor))

Questions for the Community:

Is this the intended pattern? Is creating a mirrored tensor the recommended “Vespa way” for ranking against array elements, or is there a way to use the array attribute directly in an expression that I’m missing?

Memory Bloat: Tensors are powerful but memory-heavy. If I have around 20 million documents with 10-20 strings per array, and 3 array field per document, is the memory overhead of a mapped tensor the “price of entry” for this logic?

Future Roadmap: Is there a plan to allow simpler element based checks in ranking expressions for array attributes (e.g., an in or contains operator) to avoid the tensor conversion?

I’d love to hear how others handle “parameter-based boosting against multivalued array type attributes” without ballooning up their RAM usage.

1 Like

hmm…?


What’s going on (why the tensor workaround “works”)

Vespa’s ranking expressions are built around numeric features (scalars, tensors) and match-time features. For a field like array<string> stored as an attribute, there is no ranking-expression equivalent of “string X is in this array” (i.e., no contains()/in operator you can call inside the rank-profile on the attribute values).

So the two general ways people implement “membership affects score” are:

  1. Do the membership test in the query tree, then read a numeric match feature in ranking (e.g., matches(field) or attributeMatch(field).matches).
  2. Convert/represent the set as something you can do math on (mapped tensor, weightedset + dot product, etc.).

The rank() query operator is explicitly meant for (1): only its first argument determines matching; all arguments contribute ranking features—so you can add “boost conditions” without changing recall. (VespaDocument)


Recommended solution for your exact requirement (“+100 if any overlap”)

Option A: Keep array<string> attribute; add a rank() side-clause; boost via matches() / attributeMatch()

Why it’s the “Vespa way” for boolean boosts:
You keep the efficient attribute representation, avoid per-document tensor materialization, and only compute a small numeric feature during ranking.

Single needle

YQL

select * from sources
where rank(userQuery(), mcat_tree contains @needle)

Rank-profile

rank-profile myrank inherits default {
  first-phase {
    expression: nativeRank + 100 * matches(mcat_tree)
  }
}
  • matches(name) returns 1 if the field is matched by the query, else 0. (VespaDocument)
  • rank() is designed for “boost if condition holds” without changing matching logic. (VespaDocument)

Query list (e.g., ["12345","191984"])

Use the in operator or dotProduct/weightedSet depending on whether you want only boolean or scoring by overlap.

Boolean overlap (simple): in

select * from sources
where rank(userQuery(), mcat_tree in (@ids))
  • in matches if any array element is in the provided set. (VespaDocument)
  • in is explicitly Ranking: None and “does not provide match information for text ranking features,” so don’t expect a raw score from it—but it still works well as a membership check to drive a boolean boost via matches(mcat_tree). (VespaDocument)
  • Pass values via parameter substitution (fast parsing): ids=12345,191984 (VespaDocument)

Alternative boolean boost using a count feature

expression: nativeRank + 100 * (attributeMatch(mcat_tree).matches > 0)

attributeMatch(name).matches is “the number of query terms matched in this attribute.” (VespaDocument)

Performance notes for Option A

  • If you run this condition broadly, ensure the field is configured for efficient attribute searching (commonly attribute: fast-search)—this is also called out as important for operators like in on attributes. (VespaDocument)
  • For string IDs, avoid match:text; Vespa warns it can cause recall issues with in because the field can be tokenized while in does not tokenize the input set. (VespaDocument)

When you should not use a tensor mirror

Your mirror-tensor approach is valid, but it’s usually not the first choice for a boolean “any hit → +100” boost because:

  • tensorFromLabels(...) creates a temporary tensor with “build cost and extra memory touched.” (VespaDocument)
  • Converting arrays/weightedsets to tensors at query runtime has a cost linear in the number of elements. (VespaDocument)
  • Persisting sparse/mapped tensors as attributes can be very memory-expensive at scale; there are real reports of sparse tensor attributes using “a huge amount of memory.” (GitHub)

So: for your goal (boolean boost), the tensor mirror is typically paying for power you don’t need.


If you want “more than boolean” (e.g., boost proportional to overlap, weights, etc.)

Option B: Model as weightedset<string> and use dotProduct / wand + rawScore

This is the path Vespa’s docs focus on for “query tokens vs document tags” ranking.

  • dotProduct computes a weighted overlap and stores it in raw scores you can use in ranking. (VespaDocument)
  • wand is an optimized version that can skip work when you have larger lists and want top-k behavior. (VespaDocument)
  • Docs explicitly suggest weightedset<string> fields (with attribute: fast-search) for these operators. (VespaDocument)

Why it can be better than array<string>

  • It’s purpose-built for “tags/IDs with ranking impact.” (VespaDocument)
  • You can do: “boost = 100 if rawScore > 0” or “boost = min(100, rawScore)” etc.

Passing the query-side set/weights
Vespa Query API supports sending weighted structures via query inputs; examples show map/tensor-like structures in input.query(...). (VespaDocument)


Memory bloat: is it the “price of entry”?

For your requirement: No

For boolean membership boosting, query-tree membership + match features avoids tensor storage entirely. (VespaDocument)

If you insist on mapped tensors per document: expect large RAM

At your scale (20M docs × ~15 values × 3 fields ≈ 900M labels/cells), any per-cell overhead (label refs, postings, attribute structures) becomes enormous. Sparse tensor attributes have known memory pain points in practice. (GitHub)

If you ever need tensor math, prefer:

  • Temporary conversion + rerank only (second-phase / small rerank-count), since the conversion is linear in element count. (VespaDocument)
  • Or store tensors only when you truly need fast tensor operations on many hits.

Roadmap: “contains/in in ranking expressions for array attributes”?

Two relevant signals from Vespa’s public materials:

  • Vespa invests in exposing more match/rank features from structured matching (e.g., map/array-of-struct discussions), which is the same direction as “use query matching; use match features in ranking.” (GitHub)
  • The official, documented supported solutions for multivalue matching + ranking are still query operators (rank, in, dotProduct, wand) and match features / raw scores, not string-membership functions inside the rank expression language. (VespaDocument)

So, the practical expectation today is: “membership in ranking” is done via query operators + match features, not direct array-string operations inside rank profiles.


Practical checklist (to make Option A robust)

  1. Field settings for ID-like tokens
  • Use match: word or match: exact (avoid match:text for IDs). (VespaDocument)
  1. Prefer rank() for boost-only conditions
  • rank(userQuery(), <membership-clause>) is exactly what it’s for. (VespaDocument)
  1. Use numeric match features
  • matches(mcat_tree) for boolean. (VespaDocument)
  • attributeMatch(mcat_tree).matches if you want count-based logic. (VespaDocument)
  1. If the list is large
  • in is best for filtering (no ranking). (VespaDocument)
  • For overlap scoring, move to dotProduct/wand + rawScore, typically with weightedset<string> fields. (VespaDocument)

Would these work for imported fields? 4 out of 5 array fields that I need to check the values in for boosting are imported from another schema.

1 Like

Seems with some constraints:


Yes — the same patterns work with imported fields, with a few important constraints

What “imported fields” really are

In Vespa parent/child, an imported field is effectively an attribute value read through a reference (“real-time join”), and it is usable as if it were stored with the child. (docs.vespa.ai)

That means:

  • You can use attribute(imported_field) in ranking (same syntax as local attributes). (docs.vespa.ai)
  • You can use imported fields in query operators (filtering/boosting), but remember: only attribute fields can be imported, and attribute fields do not support full text matching. (GitHub)

1) Your tensor-mirror approach works for imported fields

If you store mcat_tree_tensor in the parent document and import it into the child, then:

sum(query(my_param) * attribute(imported_mcat_tree_tensor))

works the same way as with local fields. The Vespa tutorial explicitly shows importing a sparse tensor and referencing it in ranking via attribute(imported_name) and tensor expressions. (docs.vespa.ai)

Caveat: parent docs must be global="true", and that limits how many parent docs you can have, since every node indexes all parent docs. (docs.vespa.ai)
So this is ideal when many children share a relatively small set of parents (the intended parent/child use case), not when every child effectively has its own unique parent.


2) The “no tensor mirror” approach also works: rank() + matches() on imported array attributes

Why you can’t “just do contains in ranking”

Ranking expressions can directly index numeric array attributes (attribute(name, n)), but not string array attributes. (docs.vespa.ai)
So for array<string>, you need match-time/query-time operators to create match signals.

Pattern

Use the YQL rank() operator so the membership check does not affect recall, but does compute match features. Only the first argument determines matching; all arguments contribute rank features. (docs.vespa.ai)

Then in ranking, use:

  • matches(field)1 if that field was matched by the query (docs.vespa.ai)
  • (or) matchCount(field) if you want counts (docs.vespa.ai)

Example (single ID)

where rank(
  userQuery(),
  imported_mcat_tree contains "191984"
)

Rank profile snippet:

first-phase {
  expression: nativeRank + 100 * matches(imported_mcat_tree)
}

matches(imported_mcat_tree) returns 1 if the imported field matched. (docs.vespa.ai)
And because it’s in rank(...) as the 2nd argument, it won’t change which docs are retrieved. (docs.vespa.ai)

Example (set of IDs)

If you can use in (available since Vespa 8.293.15) as your shorthand for OR, you can do:

where rank(
  userQuery(),
  imported_mcat_tree in ("12345","191984")
)

(docs.vespa.ai)

Then the same matches(imported_mcat_tree) boost applies.

Practical note: in is documented for “an integer or string field” (docs.vespa.ai). If your array<string> has match settings or tokenization expectations that make in awkward for your exact IDs, fallback is contains OR-clauses (or your tensor approach).


3) Performance notes specific to imported fields (this is the main difference)

Searching imported attribute fields (with fast-search) has extra indirection. Vespa’s tuning guide recommends reducing this cost by:

  • setting the imported field to rank: filter
  • configuring visibility-delay > 0 to enable TTL caching for the parent/child indirection (docs.vespa.ai)

Where visibility-delay is configured

services.xml<content> … <search> <visibility-delay>…
The reference explicitly states it controls TTL caching for parent-child imported fields. (docs.vespa.ai)

Bound the cost with phased ranking when needed

If you’re doing multiple membership checks (4 imported arrays + 1 local), consider pushing the expensive logic into second-phase and using rerank-count to bound work. Vespa’s phased ranking doc explains first-phase runs for all hits retrieved, while second-phase is bounded by rerank-count. (docs.vespa.ai)


4) Constraints/pitfalls to keep in mind

Imported-field limitations

Only attribute fields can be imported; no tokenization/stemming; you can use word match and fast-search, but not text match (index-field style). (GitHub)
For your ID membership checks, that’s usually fine (it’s “database-style” exact-ish matching).

Parent document scaling

Parent docs are global and replicated; this limits how many parents you can practically have. (docs.vespa.ai)
If your “other schema” has very high cardinality (close to your 20M docs), parent/child may become the dominant memory problem regardless of whether you use tensors.


Recommended reading / similar material

  • Parent/child + imported tensors used in ranking (official tutorial) (docs.vespa.ai)
  • YQL rank() + in operator semantics (docs.vespa.ai)
  • Rank features you’ll use (matches, matchCount, attributeMatch) (docs.vespa.ai)
  • Feature tuning notes for imported-field search (rank:filter, visibility-delay) (docs.vespa.ai)
  • visibility-delay reference (TTL cache for imported fields) (docs.vespa.ai)
  • GitHub issue documenting imported-field match limitations & fast-search notes (GitHub)