Semantic News Mapper

Info

Geospatial Semantic Search demo app, based on the paper: Embedding-Based Multilingual Semantic Search for Geo-Textual Data in Urban Studies

App Author: Dominik Weckmüller - you can hire me!

A lightweight frontend app showcasing the use of semantic similarity for geospatial applications such as newspaper articles.

More details in the paper or on GitHub

Supported query languages (116) Languages multilingual-e5-small (and microsoft/Multilingual-MiniLM-L12-H384 as its base model) was trained on. Source here. The larger the amount of training data, the better the language should work.

af Afrikaans (305M)
am Amharic (133M)
ar Arabic (5.4G)
as Assamese (7.6M)
az Azerbaijani (1.3G)
be Belarusian (692M)
bg Bulgarian (9.3G)
bn Bengali (860M)
bn_rom Bengali Romanized (164M)
br Breton (21M)
bs Bosnian (18M)
ca Catalan (2.4G)
cs Czech (4.4G)
cy Welsh (179M)
da Danish (12G)
de German (18G)
el Greek (7.4G)
en English (82G)
eo Esperanto (250M)
es Spanish (14G)
et Estonian (1.7G)
eu Basque (488M)
fa Persian (20G)
ff Fulah (3.1M)
fi Finnish (15G)
fr French (14G)
fy Frisian (38M)
ga Irish (108M)
gd Scottish Gaelic (22M)
gl Galician (708M)
gn Guarani (1.5M)
gu Gujarati (242M)
ha Hausa (61M)
he Hebrew (6.1G)
hi Hindi (2.5G)
hi_rom Hindi Romanized (129M)
hr Croatian (5.7G)
ht Haitian (9.1M)
hu Hungarian (15G)
hy Armenian (776M)
id Indonesian (36G)
ig Igbo (6.6M)
is Icelandic (779M)
it Italian (7.8G)
ja Japanese (15G)
jv Javanese (37M)
ka Georgian (1.1G)
kk Kazakh (889M)
km Khmer (153M)
kn Kannada (360M)
ko Korean (14G)
ku Kurdish (90M)
ky Kyrgyz (173M)
la Latin (609M)
lg Ganda (7.3M)
li Limburgish (2.2M)
ln Lingala (2.3M)
lo Lao (63M)
lt Lithuanian (3.4G)
lv Latvian (2.1G)
mg Malagasy (29M)
mk Macedonian (706M)
ml Malayalam (831M)
mn Mongolian (397M)
mr Marathi (334M)
ms Malay (2.1G)
my Burmese (46M)
my_zaw Burmese (Zawgyi) (178M)
ne Nepali (393M)
nl Dutch (7.9G)
no Norwegian (13G)
ns Northern Sotho (1.8M)
om Oromo (11M)
or Oriya (56M)
pa Punjabi (90M)
pl Polish (12G)
ps Pashto (107M)
pt Portuguese (13G)
qu Quechua (1.5M)
rm Romansh (4.8M)
ro Romanian (16G)
ru Russian (46G)
sa Sanskrit (44M)
si Sinhala (452M)
sc Sardinian (143K)
sd Sindhi (67M)
sk Slovak (6.1G)
sl Slovenian (2.8G)
so Somali (78M)
sq Albanian (1.3G)
sr Serbian (1.5G)
ss Swati (86K)
su Sundanese (15M)
sv Swedish (21G)
sw Swahili (332M)
ta Tamil (1.3G)
ta_rom Tamil Romanized (68M)
te Telugu (536M)
te_rom Telugu Romanized (79M)
th Thai (8.7G)
tl Tagalog (701M)
tn Tswana (8.0M)
tr Turkish (5.4G)
ug Uyghur (46M)
uk Ukrainian (14G)
ur Urdu (884M)
ur_rom Urdu Romanized (141M)
uz Uzbek (155M)
vi Vietnamese (28G)
wo Wolof (3.6M)
xh Xhosa (25M)
yi Yiddish (51M)
yo Yoruba (1.1M)
zh-Hans Chinese (Simplified) (14G)
zh-Hant Chinese (Traditional) (5.3G)
zu Zulu (4.3M)





Lower bounds (default: median)
Upper bounds (default: max)