Quickstart

Sturdy Statistics transforms unstructured text into structured, interpretable data— using models that are transparent, verifiable, and robust. You don’t need to write prompts, tune embeddings, or trust a black box. Every insight can be inspected, audited, and traced back to specific passages in your data.

Sturdy Statistics automatic structure enables

Data Scientists to apply traditional statistical models on unstructured data
Engineers to build robust NLP workflows
Analysts to analyze granular natural language data with SQL

All with confidence in how the outputs were generated and with the ability to easily verify every datapoint.

In the following walkthrough, we introduce Sturdy Statistics’ ability to reveal structured insights for unstructured data, not with RAG or LLM black boxes but with rigorous, statistical analysis that leverages traditional tabular data structures. We will analyze the past two years of Earnings Calls from Google, Microsoft, Amazon, NVIDIA, and META.

Resources

The indices used in this walkthrough are publicly available with no sign-up or API Key Required. You can expore these indices through our gallery or query them programatically through a globally rate-limitted pool for convenient public access.

To follow along with this walkthrough, simply run:

pip install sturdy-stats-sdk plotly

For a deeper dive, explore:

To explore your own data (not needed to follow along):

Generate your free API key

The Index Object

The core building block in the Sturdy Statistics NLP toolkit is the Index. Each Index is a set of documents and metadata that has been structured or “indexed” by our hierarchical bayesian probability mixture model. Below we are connecting to an Index that has already been trained by our earnings transcripts integration.

index = Index(id="index_c6394fde5e0a46d1a40fb6ddd549072e")

Found an existing index with id="index_c6394fde5e0a46d1a40fb6ddd549072e".

Topic Search

The first API we will explore is the Topic Search API. This API provides a direct interface to the high level themes that our index extracts. You can call with no arguments to get a list of topics ordered by how often they occur in the dataset (prevalence). The resulting data is a structured rollup of all the data in the corpus. It aggregates the topic annotations across each word, paragraph, and document and generates high level semantic statistics.

Mentions refers to the number of paragraphs in which the topic occurs. Prevalence refers to the total percentage of all data that a topic comprises.

topic_df = index.topicSearch()
topic_df.head()[["topic_id", "short_title", "topic_group_short_title", "mentions", "prevalence"]]

	topic_id	short_title	topic_group_short_title	mentions	prevalence
0	159	Accelerated Computing Systems	Technological Developments	359.0	0.042775
1	139	Consumer Behavior Insights	Growth Strategies	585.0	0.033129
2	108	Cloud Performance Metrics	Investment and Financials	157.0	0.026985
3	115	Zuckerberg on Business Strategies	Corporate Strategy	420.0	0.026971
4	127	Comprehensive Security Solutions	Investment and Financials	146.0	0.023265

Semantic Roll-up

We can quickly visualize this topic dataframe in a pie chart using plotly. The size of each slice of the pie chart represents how prominent a topic is.

This visual is very fast and useful, but quickly becomes overwhelming if we attempt to display too many topics

import plotly.express as px

topic_df["title"] = "Tech <br> Earnings Calls"
fig = px.sunburst(
    topic_df.head(100), 
    path=["title", "short_title"], 
    values="prevalence", 
    hover_data=["topic_id", "mentions"],
)
fig = procFig(fig, height=500)
fig.show()

Hierarchical Visualization

We can better visualize our thematic data by leveraging Sturdy Statistic’s hierarchical schema. We can display this hierarchy using a Sunburst visualization. The inner circle of the sunburst is the title of the plot. And the leaf nodes are the same topics topics that we displayed in the pie chart above. The middle layer is the topic_group, which Sturdy Statistics automatically extracts in conjunction with the granular topics. The size of each slice is porportional to how often it shows up in the dataset.

import plotly.express as px

topic_df["title"] = "Tech <br> Earnings Calls"
fig = px.sunburst(
    topic_df, 
    path=["title", "topic_group_short_title", "short_title"], 
    values="prevalence", 
    hover_data=["topic_id", "mentions"]
)
fig = procFig(fig, height=500)
fig.show()

What’s Next?

So far, we have explored the semantic structure Sturdy Statistics applies to unstructured data via topics and topic groups. In the following sections, we will explore how to leverage this structure to drive both high-level insights and granular analyses.

Section	Description
Part II	Topic-based Granular Retrieval
Part III	Semantic Analysis in SQL
Part IV	Statistically Tuned Search
Part V	Custom Index Creation