1. Input Source Text
Enter the main body of text. Above 1000 words (equals roughly ~600k unique segments) best use sentence boundaries, otherwise the app will crash.
This is an experimental tool allowing you to input a source text, which is then broken down into numerous overlapping segments (phrases and parts of sentences). Each unique segment is converted into a numerical representation (embedding) with minishlab/potion-retrieval-32M running on transformers.js. You can then query these indexed segments to find those most semantically similar to your query. My main motivation to create this app was to play with different chunking/segmentation strategies. It's interesting to see that the most similar segments vary quite a lot in their length - go ahead and see for yourself! Check out SemanticFinder or see my other semantic search apps and demos if you're into Guerilla Semantic Search or generally interested in this topic: GitHub.
Imagine your input text is: "The quick brown fox."
1. Without Sentence Boundaries:
Segments are generated by taking all possible contiguous word combinations:
(...and so on, for longer texts. Duplicate segments are removed before inferencing.)
2. With Sentence Boundaries (Default):
If your text is: "The cat sat. The dog ran."
First, it's split into sentences:
Then, segments are generated within each sentence independently, like the example above:
(All these segments are then combined. Duplicate segments across the entire collection are removed before embedding.)
For more details or to contribute, visit the GitHub repository.
Enter the main body of text. Above 1000 words (equals roughly ~600k unique segments) best use sentence boundaries, otherwise the app will crash.