brianpeiris / wwxr

An experiment in crawling the web for XR content

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

wwxr

wwxr was an experiment in crawling the Web for WebXR and XR content. It used crawl data from Common Crawl, and Elastic Map Reduce via cc-mrjob, to scrape the web for <a-frame> and <model-viewer> scenes. Crawled data was ingested into a MongoDB instance, and made available via a simple nodejs search and browse interface.

wwxr was useful as an experiment, and showed that it was valuable to have central access to XR content from across the Web, with a search index of keywords, but the Common Crawl data source was too limited to be generally useful, since CC only captures a random sample of the Web with each crawl. Future projects ought to consider using a live on-going crawl of the Web, using something like Nutch.

Ideally, WebXR content could also be published with meta data for easier crawling. See a discussion about this here: immersive-web/proposals#73

The community also suggested crawling for 3D models via SERP, Structured Data, used by Google, the :xr-overlay CSS pseudo-class used by the WebXR spec, and JanusVR tags.

See this post for more info: About wwxr

The seed/ directory in this repo contains a docker container and scripts for downloading and crawling Common Crawl data.

The repo also contains terraform modules for provisioning a base AWS instance, ops scripts for spinning up the nodejs/mongo site. Though there's nothing particularly special about this part of it.

1476393420667207687-vxtq6_RawlrpvUTd.mp4

About

An experiment in crawling the web for XR content

License:MIT License


Languages

Language:WebAssembly 27.9%Language:Python 23.9%Language:JavaScript 23.7%Language:Handlebars 14.0%Language:Shell 5.9%Language:HCL 3.6%Language:Dockerfile 1.0%