# 绘制现代世界：S2Vec 如何学习城市的语言

- 来源：Google Research：Blog（网页）
- 发布时间：2026-03-24 00:00
- AIHOT 链接：https://aihot.virxact.com/items/cmnwsvdnz004jsltews4kgd3q
- 原文链接：https://research.google/blog/mapping-the-modern-world-how-s2vec-learns-the-language-of-our-cities

## AI 摘要

Google Research 发布自监督地理空间框架 S2Vec，利用 S2 Geometry 将地球表面划分为分层单元格，把建筑、道路等特征栅格化为多层图像，通过掩码自编码器（MAE）学习通用嵌入向量。无需人工标注即可预测人口密度、房价等社会经济指标，在地理外推任务中表现优于图像基线模型，但树冠覆盖和海拔等环境预测任务仍需改进。

## 正文

Mapping the modern world: How S2Vec learns the language of our cities

March 24, 2026

Shushman Choudhury, Research Scientist, Google Research

We introduce S2Vec, a self-supervised framework that transforms complex geospatial data into general-purpose embeddings to predict socioeconomic and environmental patterns across the globe.

Quick links

Paper

Share Copy link ×

Copy link ×

When we think about artificial intelligence and geography, we often focus on navigation, or getting from point A to point B. However, the built environment — the complex web of roads, buildings, businesses, and infrastructure that defines our world — contains far more information than just coordinates on a map. These features tell a story about socioeconomic health, environmental patterns, and urban development.

Until recently, translating these diverse geospatial features into formats that machine learning (ML) models can understand had been a manual and labor-intensive process. Researchers often had to hand-craft specific indicators for every new problem they wanted to solve. At Google Research, we’ve developed a new way to bridge this gap as part of the Google Earth AI initiative, our collective set of geospatial efforts that transform planetary information into actionable intelligence using foundation models and advanced AI reasoning.

In line with the Earth AI vision, we recently introduced S2Vec, a self-supervised framework designed to learn general-purpose embeddings (i.e., compact, numerical summaries) of the built environment. S2Vec allows AI to understand the character of a neighborhood much like a human does, recognizing patterns in how gas stations, parks, and housing are distributed, and using that knowledge to predict metrics that matter, from population density to environmental impact. In our evaluations, S2Vec demonstrated competitive performance against image-based baselines in socioeconomic prediction tasks, particularly in geographic adaptation (extrapolation), while showing a clear need for improvement in environmental tasks, like tree cover and elevation.

The challenge: Going beyond points on a map

Geospatial data is notoriously difficult to work with because it is multimodal and varies wildly in scale. A city block might contain hundreds of data points (buildings, cafes, bus stops), while a rural area might have only a few. Standard ML models prefer structured, uniform data, like the grid of pixels in a photo.

To address this challenge, S2Vec uses a two-step process to rasterize the world:

S2 Geometry partitioning: We use the S2 Geometry library to divide the Earth's surface into a hierarchy of cells. This allows us to look at the world at different resolutions, from a whole country down to a few square meters, and look up cells of any resolution for any location very efficiently. Our internally optimized version of the S2 library can seamlessly switch between cell resolutions covering a given location rapidly.

Feature rasterization: Instead of treating buildings or roads as a list of coordinates, we count the types of features within each S2 cell and arrange them into a multi-layered image. If a cell has three coffee shops and one park, those become "colors" in our geospatial image.

This transformation allows us to treat geographical data of the built environment like a digital photograph that the AI can "see”. In turn, this rasterization opens up the vast and mature toolbox of computer vision techniques that have mostly solved the problem of natural image understanding.

S2Vec rasterizes images to learn embeddings of the built environment.

Masked autoencoding: Learning without labels

After transforming the built environment into rasterized feature images, S2Vec analyzes them using masked autoencoding (MAE), a robust self-supervised learning technique. While traditional ML relies on hand-crafted labels (e.g., manually tagging regions for income levels or air quality), self-supervision eliminates this bottleneck. Since labeling the entire planet is an impossible task, MAE allows us to unlock global insights without the need for manually crafted labels.

The MAE process systematically shows the model a “patch” of the built environment while hiding (masking) certain parts of it. The model then reconstructs the missing pieces based solely on the surrounding context:

Contextual logic: If the model sees a cluster of high-rise residential buildings and a subway station, it learns to correctly predict that there is likely a grocery store in the masked area.

Scale: By practicing this millions of times across the globe, the model learns the deep, underlying relationships between different urban features.

The output is a general-purpose embedding: a unique mathematical shorthand that captures the signature of a location. These strings of numbers represent a location’s characteristics, creating a foundation that can then be adapted for a range of tasks.

S2Vec captures the “character” of an urban area by dividing regions into grids in which each cell acts as a data point for “built environment” features like buildings and roads.

Next, the MAE learns to "fill in the blanks" of hidden map sections, identifying deep patterns in the built environment. This creates a powerful mathematical "embedding" for any location, allowing us to predict socioeconomic metrics like housing prices and population density with global scale and accuracy.

The S2Vec auto-encoder tags regions, essentially allowing them to be more deeply categorized and analyzed according to factors like socioeconomic data and population density.

Even without being told what a "financial district" or a "suburban residential zone" is, the model can group them together based purely on the spatial relationship of their features.

Evaluation

We compared S2Vec’s geospatial performance against several geospatial and image-based embedding approaches, including: SATCLIP, GEOCLIP, RS-MaMMUT, Hex2vec, and GeoVeX. The models were evaluated on multiple geospatial regression benchmarks, specifically predicting socioeconomic metrics like US-wide population density and median income, as well as environmental factors including carbon emissions, tree cover, and elevation.

Loss function: Models were trained and tuned using mean squared error (MSE) loss.

Task types: Performance was measured across two settings: random train/test splits (interpolation) and zero-shot geographic adaptation (extrapolation).

Socioeconomic strengths

S2Vec was typically found to be the best individual model for zero-shot geographic adaptation tasks, such as predicting US-wide median income or population density in unseen regions.

Multimodal fusion

Combining S2Vec with image-based embeddings (multimodal fusion) generally outperformed using any single individual modality.

Environmental weakness

While S2Vec was competitive in predicting environmental factors like carbon emissions, the results showed that "built environment" data alone isn't always enough. For these tasks, S2Vec performed best when combined with satellite imagery embeddings, which captures transportation, vegetation, and terrain features that building counts might miss.

R2 (coefficient of determination) measures how well the model explains the dataset variation (range is 0 and 1, higher is better). Notably, S2Vec by itself performed just as well as industry standard RS-MaMMUT and better than GEOCLIP. As expected, S2Vec combined with RS-MaMMUT performed the best.

Conclusion

S2Vec represents a significant step toward foundational intelligence for geography. By creating a scalable, self-supervised way to represent the built environment, we are moving away from niche, hand-crafted models and toward a more general form of geospatial AI.

The implications of this kind of work are broad. Urban planners could use the insights derived from these embeddings and others like them to better understand how infrastructure changes affect neighborhood health, while environmental researchers can more accurately model the carbon footprint of rapidly growing cities.

Teaching AI to "read" the language of our streets and buildings yields a deeper, data-driven understanding of the world we’ve built. This aligns with our broader Earth AI mission to transform planetary information into actionable intelligence — an objective supported by our existing ecosystem of foundation models, the Population Dynamics Foundation Model (PDFM), and the Remote Sensing Foundations’ RS-MaMMUT VLM model. Together, these tools provide the scale and precision needed to map and manage our impact on the planet.

Acknowledgments

We’d like to thank the other co-authors on the paper: Iveel Tsogsuren, Chandrakumari Suvarna, Elad Aharoni, Abdul Rahman Kreidieh, Chun-ta Lu, and Neha Arora. We’d also like to thank Pranjal Awasthi for valuable insights.

Algorithms & Theory

Earth AI

Machine Intelligence

Quick links

Paper

Share Copy link ×

Copy link ×

Other posts of interest

June 10, 2026 New framework for auditing machine unlearning Algorithms & Theory · Responsible AI · Security, Privacy and Abuse Prevention

June 10, 2026

Algorithms & Theory ·

Responsible AI ·

Security, Privacy and Abuse Prevention

June 5, 2026 Unlocking dependable responses with Gemini Enterprise Agent Platform’s Agentic RAG Data Management · Machine Intelligence · Natural Language Processing · Product

June 5, 2026

Data Management ·

Machine Intelligence ·

Natural Language Processing ·

Product

June 4, 2026 Towards passive heart health monitoring via smartphone camera Health & Bioscience · Human-Computer Interaction and Visualization · Machine Intelligence

June 4, 2026

Health & Bioscience ·

Human-Computer Interaction and Visualization ·

Machine Intelligence
