Semantic Segment Explorer GitHub

Welcome to the Semantic Segment Explorer!

This is an experimental tool allowing you to input a source text, which is then broken down into numerous overlapping segments (phrases and parts of sentences). Each unique segment is converted into a numerical representation (embedding) with minishlab/potion-retrieval-32M running on transformers.js. You can then query these indexed segments to find those most semantically similar to your query. My main motivation to create this app was to play with different chunking/segmentation strategies. It's interesting to see that the most similar segments vary quite a lot in their length - go ahead and see for yourself! Check out SemanticFinder or see my other semantic search apps and demos if you're into Guerilla Semantic Search or generally interested in this topic: GitHub.

How Segmentation Works (Example)

Imagine your input text is: "The quick brown fox."

1. Without Sentence Boundaries:

Segments are generated by taking all possible contiguous word combinations:

The
The quick
The quick brown
The quick brown fox.
quick
quick brown
quick brown fox.
brown
brown fox.
fox.

(...and so on, for longer texts. Duplicate segments are removed before inferencing.)

2. With Sentence Boundaries (Default):

If your text is: "The cat sat. The dog ran."

First, it's split into sentences:

Sentence 1: "The cat sat."
Sentence 2: "The dog ran."

Then, segments are generated within each sentence independently, like the example above:

From Sentence 1:
The
The cat
The cat sat.
cat
cat sat.
sat.
From Sentence 2:
The
The dog
The dog ran.
dog
dog ran.
ran.

(All these segments are then combined. Duplicate segments across the entire collection are removed before embedding.)

Segment Generation Complexity

For more details or to contribute, visit the GitHub repository.

1. Input Source Text

Enter the main body of text. Above 1000 words (equals roughly ~600k unique segments) best use sentence boundaries, otherwise the app will crash.

Initializing embedder...