Semantic hexagonal on-the-fly binning metrics for city-scale georeferenced social media data

Info

Demo app for the paper: Embedding-Based Multilingual Semantic Search for Geo-Textual Data in Urban Studies

Paper Authors: Dominik Weckmüller, Alexander Dunkel, Dirk Burghardt

A lightweight frontend app showcasing the use of semantic similarity for geospatial applications such as geosocial media.

More details in the paper or on GitHub

Supported query languages (116)

Languages multilingual-e5-small (and microsoft/Multilingual-MiniLM-L12-H384 as its base model) was trained on. Source here. The larger the amount of training data, the better the language should work.

af Afrikaans (305M)
am Amharic (133M)
ar Arabic (5.4G)
as Assamese (7.6M)
az Azerbaijani (1.3G)
be Belarusian (692M)
bg Bulgarian (9.3G)
bn Bengali (860M)
bn_rom Bengali Romanized (164M)
br Breton (21M)
bs Bosnian (18M)
ca Catalan (2.4G)
cs Czech (4.4G)
cy Welsh (179M)
da Danish (12G)
de German (18G)
el Greek (7.4G)
en English (82G)
eo Esperanto (250M)
es Spanish (14G)
et Estonian (1.7G)
eu Basque (488M)
fa Persian (20G)
ff Fulah (3.1M)
fi Finnish (15G)
fr French (14G)
fy Frisian (38M)
ga Irish (108M)
gd Scottish Gaelic (22M)
gl Galician (708M)
gn Guarani (1.5M)
gu Gujarati (242M)
ha Hausa (61M)
he Hebrew (6.1G)
hi Hindi (2.5G)
hi_rom Hindi Romanized (129M)
hr Croatian (5.7G)
ht Haitian (9.1M)
hu Hungarian (15G)
hy Armenian (776M)
id Indonesian (36G)
ig Igbo (6.6M)
is Icelandic (779M)
it Italian (7.8G)
ja Japanese (15G)
jv Javanese (37M)
ka Georgian (1.1G)
kk Kazakh (889M)
km Khmer (153M)
kn Kannada (360M)
ko Korean (14G)
ku Kurdish (90M)
ky Kyrgyz (173M)
la Latin (609M)
lg Ganda (7.3M)
li Limburgish (2.2M)
ln Lingala (2.3M)
lo Lao (63M)
lt Lithuanian (3.4G)
lv Latvian (2.1G)
mg Malagasy (29M)
mk Macedonian (706M)
ml Malayalam (831M)
mn Mongolian (397M)
mr Marathi (334M)
ms Malay (2.1G)
my Burmese (46M)
my_zaw Burmese (Zawgyi) (178M)
ne Nepali (393M)
nl Dutch (7.9G)
no Norwegian (13G)
ns Northern Sotho (1.8M)
om Oromo (11M)
or Oriya (56M)
pa Punjabi (90M)
pl Polish (12G)
ps Pashto (107M)
pt Portuguese (13G)
qu Quechua (1.5M)
rm Romansh (4.8M)
ro Romanian (16G)
ru Russian (46G)
sa Sanskrit (44M)
si Sinhala (452M)
sc Sardinian (143K)
sd Sindhi (67M)
sk Slovak (6.1G)
sl Slovenian (2.8G)
so Somali (78M)
sq Albanian (1.3G)
sr Serbian (1.5G)
ss Swati (86K)
su Sundanese (15M)
sv Swedish (21G)
sw Swahili (332M)
ta Tamil (1.3G)
ta_rom Tamil Romanized (68M)
te Telugu (536M)
te_rom Telugu Romanized (79M)
th Thai (8.7G)
tl Tagalog (701M)
tn Tswana (8.0M)
tr Turkish (5.4G)
ug Uyghur (46M)
uk Ukrainian (14G)
ur Urdu (884M)
ur_rom Urdu Romanized (141M)
uz Uzbek (155M)
vi Vietnamese (28G)
wo Wolof (3.6M)
xh Xhosa (25M)
yi Yiddish (51M)
yo Yoruba (1.1M)
zh-Hans Chinese (Simplified) (14G)
zh-Hant Chinese (Traditional) (5.3G)
zu Zulu (4.3M)

Semantic Hexbins

Input data

Query (semantic similarity)

Visualization Mode

Radius

Color scale extent

Lower bounds (default: median)

Upper bounds (default: max)

Minimum similarity score

Minimum number of above scores in hexbin

Binning function

Instagram Locations (click hexbin)