skrapeit / skrape.it

A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.

Home Page:https://docs.skrape.it

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] BrowserFetcher is still not working on Android

kazemcodes opened this issue · comments

Here is the error I get when using BrowseFetcher
I think the error is beacuse of hunit-android

2022-04-12 21:07:05.566 5395-5451/ir.kazemcodes.infinityreader E/AndroidRuntime: FATAL EXCEPTION: DefaultDispatcher-worker-2
    Process: ir.kazemcodes.infinityreader, PID: 5395
    java.lang.NoClassDefFoundError: Failed resolution of: Ljava/awt/datatransfer/ClipboardOwner;
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.handleCharacters(HtmlUnitNekoDOMBuilder.java:593)
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.startElement(HtmlUnitNekoDOMBuilder.java:303)
        at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source:146)
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.startElement(HtmlUnitNekoDOMBuilder.java:289)
        at org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Unknown Source:0)
        at net.sourceforge.htmlunit.cyberneko.HTMLTagBalancer.startElement(HTMLTagBalancer.java:812)
        at net.sourceforge.htmlunit.cyberneko.filters.DefaultFilter.startElement(DefaultFilter.java:140)
        at net.sourceforge.htmlunit.cyberneko.filters.NamespaceBinder.startElement(NamespaceBinder.java:278)
        at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2811)
        at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2131)
        at net.sourceforge.htmlunit.cyberneko.HTMLScanner.scanDocument(HTMLScanner.java:937)
        at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:443)
        at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:394)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source:5)
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.parse(HtmlUnitNekoDOMBuilder.java:758)
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.parse(HtmlUnitNekoHtmlParser.java:204)
        at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:298)
        at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:218)
        at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:686)
        at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:588)
        at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:506)
        at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:413)
        at it.skrape.fetcher.BrowserFetcher.fetch(BrowserFetcher.kt:19)
        at org.ireader.presentation.feature_library.presentation.LibraryScreenKt$LibraryScreen$3$2$1$1$1$2$1.invokeSuspend(LibraryScreen.kt:157)
        at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
        at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
        at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)
     Caused by: java.lang.ClassNotFoundException: Didn't find class "java.awt.datatransfer.ClipboardOwner" on path: DexPathList[[dex file "/data/data/ir.kazemcodes.infinityreader/code_cache/.overlay/base.apk/classes4.dex", dex file "/data/data/ir.kazemcodes.infinityreader/code_cache/.overlay/base.apk/classes11.dex", zip file "/data/app/~~frwX1pOecaUkVEBUDn-uGQ==/ir.kazemcodes.infinityreader-nayBeOZyEhA8jqrDIdfXeQ==/base.apk"],nativeLibraryDirectories=[/data/app/~~frwX1pOecaUkVEBUDn-uGQ==/ir.kazemcodes.infinityreader-nayBeOZyEhA8jqrDIdfXeQ==/lib/arm64, /data/app/~~frwX1pOecaUkVEBUDn-uGQ==/ir.kazemcodes.infinityreader-nayBeOZyEhA8jqrDIdfXeQ==/base.apk!/lib/arm64-v8a, /system/lib64, /system/system_ext/lib64]]
        at dalvik.system.BaseDexClassLoader.findClass(BaseDexClassLoader.java:207)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:379)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:312)
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.handleCharacters(HtmlUnitNekoDOMBuilder.java:593) 
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.startElement(HtmlUnitNekoDOMBuilder.java:303) 
        at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source:146) 
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.startElement(HtmlUnitNekoDOMBuilder.java:289) 
        at org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Unknown Source:0) 
        at net.sourceforge.htmlunit.cyberneko.HTMLTagBalancer.startElement(HTMLTagBalancer.java:812) 
        at net.sourceforge.htmlunit.cyberneko.filters.DefaultFilter.startElement(DefaultFilter.java:140) 
        at net.sourceforge.htmlunit.cyberneko.filters.NamespaceBinder.startElement(NamespaceBinder.java:278) 
        at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2811) 
        at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2131) 
        at net.sourceforge.htmlunit.cyberneko.HTMLScanner.scanDocument(HTMLScanner.java:937) 
        at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:443) 
        at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:394) 
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source:5) 
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.parse(HtmlUnitNekoDOMBuilder.java:758) 
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.parse(HtmlUnitNekoHtmlParser.java:204) 
        at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:298) 
        at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:218) 
        at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:686) 
        at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:588) 
        at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:506) 
        at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:413) 
        at it.skrape.fetcher.BrowserFetcher.fetch(BrowserFetcher.kt:19) 
        at org.ireader.presentation.feature_library.presentation.LibraryScreenKt$LibraryScreen$3$2$1$1$1$2$1.invokeSuspend(LibraryScreen.kt:157) 
        at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) 
        at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106) 
        at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571) 
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750) 
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678) 
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665) 

I think the issue is because of this

HtmlUnit/htmlunit#448

can skrapeit bypass Cloudflare protection?

I think the issue is because of this

HtmlUnit/htmlunit#448

the issue is already fix in the html-unit-2.59.0-SNAPHSHOT but there is a new problem in that build, not all sites throw this exception some complex url throws it for example https://pstbn.top/?c865fde3461094d1#2hAyyUKtXm72BHLzyzhq7UBug9YMP1FCgnJccA8YyQ2n

Ok I see. Thx for finding.
I will check if we have everything that @rbri suggests, if not I will add it and make new release

commented

I got here

ScriptException: missing ; before statement (https://pstbn.top/js/zlib-1.2.11.js#6)

Do you see the same?

I got here

ScriptException: missing ; before statement (https://pstbn.top/js/zlib-1.2.11.js#6)

Do you see the same?

I only got this exception
java.lang.NoClassDefFoundError: Failed resolution of: Ljava/awt/datatransfer/ClipboardOwner

I have a question regarding htmlunit, is there any way to make headless browser suspend it request until a certail html tag or some criterial fullfit before fetching the htmls
something like this func

I got here

ScriptException: missing ; before statement (https://pstbn.top/js/zlib-1.2.11.js#6)

Do you see the same?

I only got this exception java.lang.NoClassDefFoundError: Failed resolution of: Ljava/awt/datatransfer/ClipboardOwner

I have a question regarding htmlunit, is there any way to make headless browser suspend it request until a certail html tag or some criterial fullfit before fetching the htmls

actually skrapeit is using the htmlunit 2.59.0 which throws this exception, higher versions actually required higher api which is android O, I havent tested that version

commented

i guess this api requirement has something to do with changes in Rhino

Any news about this issue? I got this error while using BrowserFetcher

java.lang.NoSuchFieldError: No static field INSTANCE of type Lorg/apache/http/conn/ssl/AllowAllHostnameVerifier; in class Lorg/apache/http/conn/ssl/AllowAllHostnameVerifier; or its superclasses (declaration of 'org.apache.http.conn.ssl.AllowAllHostnameVerifier' appears in /system/framework/framework.jar!classes3.dex)

Are you using latest version of skrapeit (1.2.1)?

Since recently fixes the issue for other people, e.g. here #185 (comment)

Are you using latest version of skrapeit (1.2.1)?

Since recently fixes the issue for other people, e.g. here #185 (comment)

I just tried with 1.2.1 and got this error

Execution failed for task ':app:mergeDebugJavaResource'.
> A failure occurred while executing com.android.build.gradle.internal.tasks.MergeJavaResWorkAction
   > 2 files found with path 'mozilla/public-suffix-list.txt' from inputs:
      - /Users/yusuf/.gradle/caches/transforms-3/f245c43f9945c78889e5173b03033420/transformed/jetified-htmlunit-android-2.58.0.jar
      - /Users/yusuf/.gradle/caches/transforms-3/96071c01f90e37e991c04a7f8de1ffc4/transformed/jetified-httpclient-4.5.6.jar
     Adding a packagingOptions block may help, please refer to
     https://google.github.io/android-gradle-dsl/current/com.android.build.gradle.internal.dsl.PackagingOptions.html
     for more information

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

Adding packagingOptions and invalid cache - restart did not help

There is also a SO question about this but there is not any answer yet

Are you using latest version of skrapeit (1.2.1)?
Since recently fixes the issue for other people, e.g. here #185 (comment)

I just tried with 1.2.1 and got this error

Execution failed for task ':app:mergeDebugJavaResource'.
> A failure occurred while executing com.android.build.gradle.internal.tasks.MergeJavaResWorkAction
   > 2 files found with path 'mozilla/public-suffix-list.txt' from inputs:
      - /Users/yusuf/.gradle/caches/transforms-3/f245c43f9945c78889e5173b03033420/transformed/jetified-htmlunit-android-2.58.0.jar
      - /Users/yusuf/.gradle/caches/transforms-3/96071c01f90e37e991c04a7f8de1ffc4/transformed/jetified-httpclient-4.5.6.jar
     Adding a packagingOptions block may help, please refer to
     https://google.github.io/android-gradle-dsl/current/com.android.build.gradle.internal.dsl.PackagingOptions.html
     for more information

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

Adding packagingOptions and invalid cache - restart did not help

There is also a SO question about this but there is not any answer yet

Putting public-suffix to packaging-options solve the compilation error but this time got the same error with @kazemcodes

E/AndroidRuntime:     at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:570)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:749)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:677)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:664)
    	Suppressed: kotlinx.coroutines.DiagnosticCoroutineContextException: [StandaloneCoroutine{Cancelling}@7073866, Dispatchers.Main.immediate]
    Caused by: java.lang.ClassNotFoundException: Didn't find class "java.awt.datatransfer.ClipboardOwner" on path: DexPathList[[zip file "/data/app/com.project.skrapeplayground-1r7tHui0F1lYklkZgTKUlQ==/base.apk"],nativeLibraryDirectories=[/data/app/com.project.skrapeplayground-1r7tHui0F1lYklkZgTKUlQ==/lib/arm64, /system/lib64, /hw_product/lib64, /system/product/lib64]]
        at dalvik.system.BaseDexClassLoader.findClass(BaseDexClassLoader.java:209)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:379)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:312)
        	... 49 more

Recently, htmlunit-android release a new snapshot , that fixed this problem, but right now it requires at least android O as minimum api requirement

please update the html unit to latest snapshot, this problem is fixed in last snapshot

net.sourceforge.htmlunit:htmlunit-android:2.63.0-SNAPSHOT
commented

htmlunit-android:2.63.0 was released some days ago (https://twitter.com/htmlunit)

Big Thx for the great work @rbri. I will bump the version in browserfetcher and release ne version of skrape it

commented

it's a pleasure

skrapeit patch version 1.2.2 has just been published to maven central