wwxr was an experiment in crawling the Web for WebXR and XR content. It used crawl data from Common Crawl, and Elastic Map Reduce via cc-mrjob, to scrape the web for <a-frame>
and <model-viewer>
scenes.
Crawled data was ingested into a MongoDB instance, and made available via a simple nodejs search and browse interface.
wwxr was useful as an experiment, and showed that it was valuable to have central access to XR content from across the Web, with a search index of keywords, but the Common Crawl data source was too limited to be generally useful, since CC only captures a random sample of the Web with each crawl. Future projects ought to consider using a live on-going crawl of the Web, using something like Nutch.
Ideally, WebXR content could also be published with meta data for easier crawling. See a discussion about this here: immersive-web/proposals#73
The community also suggested crawling for 3D models via SERP, Structured Data, used by Google, the :xr-overlay CSS pseudo-class used by the WebXR spec, and JanusVR tags.
See this post for more info: About wwxr
The seed/
directory in this repo contains a docker container and scripts for downloading and crawling Common Crawl data.
The repo also contains terraform modules for provisioning a base AWS instance, ops scripts for spinning up the nodejs/mongo site. Though there's nothing particularly special about this part of it.