Semantic Hexbins

Info

Demo app for the paper: yet to be published.

Paper Authors: anonymized for peer review

A lightweight frontend app showcasing the use of semantic similarity for geospatial applications such as geosocial media.

More details in the paper or on GitHub

Supported query languages (116) Languages multilingual-e5-small (and microsoft/Multilingual-MiniLM-L12-H384 as its base model) was trained on. Source here. The larger the amount of training data, the better the language should work.

af Afrikaans (305M)
am Amharic (133M)
ar Arabic (5.4G)
as Assamese (7.6M)
az Azerbaijani (1.3G)
be Belarusian (692M)
bg Bulgarian (9.3G)
bn Bengali (860M)
bn_rom Bengali Romanized (164M)
br Breton (21M)
bs Bosnian (18M)
ca Catalan (2.4G)
cs Czech (4.4G)
cy Welsh (179M)
da Danish (12G)
de German (18G)
el Greek (7.4G)
en English (82G)
eo Esperanto (250M)
es Spanish (14G)
et Estonian (1.7G)
eu Basque (488M)
fa Persian (20G)
ff Fulah (3.1M)
fi Finnish (15G)
fr French (14G)
fy Frisian (38M)
ga Irish (108M)
gd Scottish Gaelic (22M)
gl Galician (708M)
gn Guarani (1.5M)
gu Gujarati (242M)
ha Hausa (61M)
he Hebrew (6.1G)
hi Hindi (2.5G)
hi_rom Hindi Romanized (129M)
hr Croatian (5.7G)
ht Haitian (9.1M)
hu Hungarian (15G)
hy Armenian (776M)
id Indonesian (36G)
ig Igbo (6.6M)
is Icelandic (779M)
it Italian (7.8G)
ja Japanese (15G)
jv Javanese (37M)
ka Georgian (1.1G)
kk Kazakh (889M)
km Khmer (153M)
kn Kannada (360M)
ko Korean (14G)
ku Kurdish (90M)
ky Kyrgyz (173M)
la Latin (609M)
lg Ganda (7.3M)
li Limburgish (2.2M)
ln Lingala (2.3M)
lo Lao (63M)
lt Lithuanian (3.4G)
lv Latvian (2.1G)
mg Malagasy (29M)
mk Macedonian (706M)
ml Malayalam (831M)
mn Mongolian (397M)
mr Marathi (334M)
ms Malay (2.1G)
my Burmese (46M)
my_zaw Burmese (Zawgyi) (178M)
ne Nepali (393M)
nl Dutch (7.9G)
no Norwegian (13G)
ns Northern Sotho (1.8M)
om Oromo (11M)
or Oriya (56M)
pa Punjabi (90M)
pl Polish (12G)
ps Pashto (107M)
pt Portuguese (13G)
qu Quechua (1.5M)
rm Romansh (4.8M)
ro Romanian (16G)
ru Russian (46G)
sa Sanskrit (44M)
si Sinhala (452M)
sc Sardinian (143K)
sd Sindhi (67M)
sk Slovak (6.1G)
sl Slovenian (2.8G)
so Somali (78M)
sq Albanian (1.3G)
sr Serbian (1.5G)
ss Swati (86K)
su Sundanese (15M)
sv Swedish (21G)
sw Swahili (332M)
ta Tamil (1.3G)
ta_rom Tamil Romanized (68M)
te Telugu (536M)
te_rom Telugu Romanized (79M)
th Thai (8.7G)
tl Tagalog (701M)
tn Tswana (8.0M)
tr Turkish (5.4G)
ug Uyghur (46M)
uk Ukrainian (14G)
ur Urdu (884M)
ur_rom Urdu Romanized (141M)
uz Uzbek (155M)
vi Vietnamese (28G)
wo Wolof (3.6M)
xh Xhosa (25M)
yi Yiddish (51M)
yo Yoruba (1.1M)
zh-Hans Chinese (Simplified) (14G)
zh-Hant Chinese (Traditional) (5.3G)
zu Zulu (4.3M)





Lower bounds (default: median)
Upper bounds (default: max)