Statistical Search

In addition to power our semantic APIs, Sturdy Statistics’ probability model also powers our statistical search engine. When you submit a search query, your index model maps your query to its thematic contents. Because our models return structured Bayesian likelihoods, we are able use a statistically meaningful scoring metric called hellinger distance to score each search candidate. Unlike cosine distance whose values are not well defined and can be used only to rank, the hellinger distance score defines the percentage of a document that ties directly to your theme.

This well defined score enables not only search ranking, but semantic search filter as well with the ability to define a hand-selected hard cutoff

Focused Example: Google’s Discussions about ‘FX’

We are using two new capabilities in the Query API: filtering and search. Our Query API supports arbitrary sql conditions in the filter. We leverage DuckDB under the hood and support all of DuckDB’s sql querying syntax.

Search Parameters

In addition to accepting a search term, our Query API accepts a semantic_search_cutoff and a semantic_search_weight. The semantic_search_cutoff is a value between 0-1. The value corresponds to the percentage of the paragraph that focuses on the search term. Our value of .1 below means that at least 10% of the paragraph must focus on our search term. This enables flexible semantic filtering capabilities while providing sensible defaults out of the box.

The semantic_search_weight dictates the weight placed on our thematic search score vs our TF-IDF weighted exact match score. Each use case is different and our API provides the flexibility to tune you indices according to your use-case while providing sensible defaults out of the box.

Semantic Search Results

In the examples below, you’ll notice that our index surfaced paragraphs that matched not only on FX, but also on foreign exchange, pressures, and slowdown.

NB: Utility to Display Annotated Text Results

Code
def display_text(df, highlight):    
    import re
    try:
        from IPython.display import display, Markdown
        show = lambda x: display(Markdown(x))
    except ImportError:
        show = print
    def fmt(r):
        t = "\n".join(f"1. {d['short_title']}: {int(d['prevalence']*100)}%" for d in r["paragraph_topics"][:5])
        txt = re.sub(r"[*$]", "", r["text"])
        h = lambda m: f"**{m.group()}**" \
            if any((len(w) < 4 and m.group().lower() == w.lower()) \
            or (len(w) >= 4 and w.lower() in m.group().lower()) for w in highlight) else m.group()
        body = re.sub(r"\b\w+\b", h, txt)
        return f"<em>\n\n#### Result {r.name+1}/{df.index.max()+1}\n\n##### {r['ticker']} {r['pub_quarter']}\n\n{t}\n\n{body}</em>"
    show("\n\n...\n\n".join(df.apply(fmt, axis=1)))
SEARCH_QUERY = "fx"
FILTER = "ticker='GOOG'"

df = index.query(SEARCH_QUERY, filters=FILTER, 
                 semantic_search_cutoff=.1, semantic_search_weight=.3, 
                 max_excerpts_per_doc=20, limit=200)

topics = index.topicSearch(SEARCH_QUERY).iloc[:10]
words_to_highlight = topicWords.loc[topicWords.topic_id.apply(lambda x: x in topics.topic_id)].topic_words.explode()
display_text(df.iloc[[0, -1]], highlight=words_to_highlight) #highlight=["fx", "foreign", "exchange", "stabilization", "pressures", "slowdown", "pullback"])

Result 1/27

GOOG 2023Q1
  1. Business Growth Strategies: 39%
  2. Revenue Growth Forecasts: 26%
  3. Consumer Behavior Insights: 21%
  4. Google Advertising Revenue Growth: 5%
  5. AI-Powered Advertising: 2%

Justin Post: Just digging into Search kind of low single-digit growth ex FX. Can you talk about the pressures there, volume versus pricing or CPCs? What’s really driving the slowdown? It’s kind of almost back to ‘09 recession levels. Just think about that. And then any signs that we’re near a bottom? Any stabilization in growth rates you can talk about or how your outlook is for’23 on that?

Result 27/27

GOOG 2023Q2
  1. Google Advertising Revenue Growth: 87%
  2. Macroeconomic Headwinds: 10%

Google Services revenue of 62 billion were up 1% year-on-year, including the effect of a modest foreign exchange headwind. In Google Advertising, Search and Other, revenues grew 2% year-over-year, reflecting an increase in the travel and retail verticals, offset partially by a decline in finance as well as in media and entertainment. In YouTube Ads, we saw signs of stabilization and performance, while in network, there was an incremental pullback in advertiser spend. Google Other revenues were up 9% year-over-year led by strong growth in YouTube subscriptions revenues.

Exact Match misses 75% of the Results

The exact match search only hits on a single result. We are missing 20/27 of the matching exchanges because of the restrictiveness of exact matching rules.

## Setting semantic search weight to 0 forces an exact match only search. 
df = index.query(SEARCH_QUERY, filters=FILTER, semantic_search_cutoff=.1, semantic_search_weight=0, max_excerpts_per_doc=40, limit=200)
display_text(df.iloc[[0, -1]], highlight=["fx"])

Result 1/7

GOOG 2023Q1
  1. Google Advertising Revenue Growth: 53%
  2. Advertising Revenue Trends: 40%
  3. Cloud Performance Metrics: 2%

I’ll highlight 2 other factors that affected our Ads business in Q4. Ruth will provide more detail. In Search and Other, revenues grew moderately year-over-year, excluding the impact of FX, reflecting an increase in retail and travel, offset partially by a decline in finance. At the same time, we saw further pullback in spend by some advertisers in Search in Q4 versus Q3. In YouTube and Network, the year-over-year revenue declines were due to a broadening of pullbacks in advertiser spend in the fourth quarter.

Result 7/7

GOOG 2025Q1
  1. Google Advertising Revenue Growth: 58%
  2. Business Growth Strategies: 33%
  3. Alphabet Earnings Calls: 3%

Anat Ashkenazi: And on the question regarding my comment on lapping the strength in financial services, this is primarily related to the structural changes with regards to insurance, it is more specifically within financial services, it was the insurance segment and we saw that continue, but it was a one-time kind of a step up and then we saw it throughout the year. I am not going to give any specific numbers as to what we expect to see in 2025, but I am pleased with the fact that we are seeing and continue to see strength across really all verticals including retail and exiting the year in a position of strength. If anything, I would highlight as you think about the year, the comments I have made about the impact of FX, as well as the fact that we have one less day of revenue in Q1.

Jumping Back to the High Level.

In our FX search query, the data very useful, but that’s a lot to read and digest. Let’s try to summarize that data into a high level overview. Because our Topic Search API supports the exact same parameters as our Query API, we can instantly switch between high level insights and granular data.

df = index.topicSearch(SEARCH_QUERY, FILTER, semantic_search_cutoff=.1)
df["search_query"] = SEARCH_QUERY
fig = px.sunburst(df, path=["search_query", "short_title"], values="prevalence", hover_data=["topic_id"],)
procFig(fig, height=500).show()

What’s Next?

Section Description
Part V Custom Index Creation