debakarr / kodekloud-downloader

Simple downloaded for https://kodekloud.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wrong scrapped videos urls, extension for #23

Ziad-Tawfik opened this issue · comments

Hello Debakarr,

thanks for your reply and fast update for the code :)
however I replied to the issue but it was closed thanks to find below findings.

I commented out this part of logic so the bot is working without failure.
raise SystemExit( "Your cookie might have expired or you don't have access to the course." "\nPlease refresh/regenerate the cookie or enroll in the course and try again." )

however the main problem was in scraping the data as the value of below two variables (main_lesson__content & topics) don't include the correct values for videos when using soup.find and zip function

main_lesson_content = soup.find("div", class_="lessons_main__content") or soup.find("div", class_="ld-lesson-list") topics = main_lesson_content.find_all("div", class_="w-dyn-item") or main_lesson_content.find_all( "div", class_="ld-item-list-items")

I investigated and printed both of them found that scraped part of the "billing and pricing" topic is zipped again with urls of the previous part which is "Technology - Part Two", so this raises an error however the above fix will just create a folder of "billing and pricing" topic but downloads all the videos in the "Technology - Part Two" again

that's why the videos are appeared to be duplicated, is there any way to fix this ?

Thanks in advance, you've done a great work though!

oh, thanks for explanation. I am still in office. I will check once I return back.

Commenting out this part will not let you download Billing and Pricing videos because Kodekloud has bug in this course table of contents - instead of "Billing and Pricing" videos there are links to "Technology - Part Two" ones. You can check it yourself with curl.

Same problem was described here: #9

@Tisona
Yes, I mentioned that it has downloaded the same content of "Technology - Part Two twice" but the bot continued to work and didn't stop.
When I inspect the page with browser I can't see any problem with "Billing and Pricing" part, it looks the same as the previous parts

the same problem also with "Docker-vs-ContainerD (13:05)" in Core Concepts in the below course
https://kodekloud.com/courses/certified-kubernetes-administrator-cka/

Downloader does not use browser, use curl to check what downloader actually receives.
Also check issue I mentioned above, this will make things clearer.

The issue looks like to be with the request made without login for the course page does not show the same content as when the request is made while login. Opening https://kodekloud.com/courses/aws-cloud-practitioner/ in Incognito:

Very strange.

image

What I quickly tried is copying the curl command using network tab (when I was logged in):
image

Using curlconverter to convert that into Python code: https://curlconverter.com/python/

and then check if the video is coming in the response body:
image

So, looks like if we can do requests using cookie, that might help. But the response body is a bit different then the one we get without any auth or cookie, So the class name to parse the topic and lesson need to change.

Issue got autoclose but you can reopen if issue is not fixed.