serverless-chrome gets incomplete source in Lambda
mynameissue opened this issue · comments
Hello,
I scrape the Johns Hopkins University's COVID-19 Map in a local environment using python and selenium to get the number of the cases by country and so on.
However, when I tried to do the same thing in aws Lambda, it failed.
The problem is that I can't get the value I want to get; when I try to get the html of covid-map, there is almost nothing inside the tag. ( I will note it at the end).
Firstly, I thought that is because that the serverless-chrome in my aws doesn't support webGL. However, I read the issue(#108) and enabled webGL, the problem still occurs. (I checked whether the browser supports webGL on this website.
As far as I can think of, the difference between the local environment and Lambda is whether using a regular Chrome or serverless-chrome browser.
Could anyone help to resolve this please?
This is the body element which serverless-chrome got.
<body>
<script src="https://js.arcgis.com/4.19/init.js" data-amd="true"></script>
<script src="assets/amd-loading-3b41833a646bb19c89df9de8fb3f1a27.js" data-amd-loading="true"></script>
<div id="initialLoadingContainer" class="loader-icon-container">
<div class="loader is-active padding-leader-3 padding-trailer-3">
<div class="loader-bars"></div>
</div>
</div>
</body>
This is the code on Lambda.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import os
def lambda_handler(event, context):
URL = "https://www.arcgis.com/apps/dashboards/85320e2ea5424dfaaa75ae62e5c06e61"
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--disable-gpu")
options.add_argument("--hide-scrollbars")
options.add_argument("--single-process")
options.add_argument("--ignore-certificate-errors")
options.add_argument("--window-size=880x996")
options.add_argument("--no-sandbox")
options.add_argument("--homedir=/tmp")
options.binary_location = "/opt/python/bin/headless-chromium"
options.add_argument('--disable-dev-shm-usage')
options.add_argument("--disable-application-cache")
options.add_argument("--disable-infobars")
options.add_argument("--enable-logging")
options.add_argument("--log-level=0")
options.add_argument('--blink-settings=imagesEnabled=false')
options.add_argument('--disable-extensions')
options.add_argument('--proxy-server="direct://"')
options.add_argument('--proxy-bypass-list=*')
options.add_argument('--start-maximized')
options.add_argument('--ignore-gpu-blacklist')
options.add_argument('--enable-webgl')
options.add_argument('--disable-web-security')
options.add_argument('--use-gl=osmesa')
options.add_argument('--data-path=/tmp/data-path')
options.add_argument('--disk-cache-dir=/tmp/cache-dir')
browser = webdriver.Chrome(
"/opt/python/bin/chromedriver",
options=options
)
time.sleep(10)
browser.get(URL)
time.sleep(60)
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
print(html)
Same problem, any solution?
Any updates here?