Semantic Segment Explorer

Welcome to the Semantic Segment Explorer!

This is an experimental tool allowing you to input a source text, which is then broken down into numerous overlapping segments (phrases and parts of sentences). Each unique segment is converted into a numerical representation (embedding) with minishlab/potion-retrieval-32M running on transformers.js. You can then query these indexed segments to find those most semantically similar to your query. My main motivation to create this app was to play with different chunking/segmentation strategies. It's interesting to see that the most similar segments vary quite a lot in their length - go ahead and see for yourself! Check out SemanticFinder or see my other semantic search apps and demos if you're into Guerilla Semantic Search or generally interested in this topic: GitHub.

How Segmentation Works (Example)

Imagine your input text is: "The quick brown fox."

1. Without Sentence Boundaries:

Segments are generated by taking all possible contiguous word combinations:

The

The quick

The quick brown

The quick brown fox.

quick

quick brown

quick brown fox.

brown

brown fox.

fox.

(...and so on, for longer texts. Duplicate segments are removed before inferencing.)

2. With Sentence Boundaries (Default):

If your text is: "The cat sat. The dog ran."

First, it's split into sentences:

Sentence 1: "The cat sat."

Sentence 2: "The dog ran."

Then, segments are generated within each sentence independently, like the example above:

From Sentence 1:

The

The cat

The cat sat.

cat

cat sat.

sat.

From Sentence 2:

The

The dog

The dog ran.

dog

dog ran.

ran.

(All these segments are then combined. Duplicate segments across the entire collection are removed before embedding.)

Segment Generation Complexity

With Sentence Boundaries (default): The text is first split into S sentences. Segments are generated within each sentence. If N_s is the average number of words per sentence, the number of unique segments is roughly Σ(N_s,i*(N_s,i+1)/2) for each sentence i (though duplicates across sentences or within are removed). This is generally less than O(N²).

Without Sentence Boundaries: The number of potential segments grows quadratically with the total number of words (N) in your source text, approximately N \* (N+1) / 2. After removing duplicates, the actual count may be lower but can still be substantial. This is O(N²) in terms of combinations generated.

For more details or to contribute, visit the GitHub repository.

1. Input Source Text

Enter the main body of text. Above 1000 words (equals roughly ~600k unique segments) best use sentence boundaries, otherwise the app will crash.

Use Sentence Boundaries for Segmentation

Initializing embedder...

Semantic Segment Explorer

Welcome to the Semantic Segment Explorer!

How Segmentation Works (Example)

Segment Generation Complexity

1. Input Source Text

2. Query Your Indexed Text

3. Results