Facebook crawling using IP hiding techniques

Crawling id, user info, content, date, comments and replies of posts in a Facebook page

Demo: https://www.youtube.com/watch?v=Fx0UWOzYsig

Overview

I. Features

Getting information of posts.
Filtering comments.
Not required sign in.
Checking redirect
Running with Incognito window.
Simplifying browser to minimize time complexity.
Hiding IP address to prevent from banning by:
- Collecting proxies and filtering the slowest ones from:
- Tor Relays which used in Tor Browser, a network is comprised of thousands of volunteer-run servers.

II. Weaknesses

Unable to handle a few failed responses. Example: RATE LIMIT EXCEEDED response (Facebook prevents from loading more) => have to run without HEADLESS to detect
Quite slow when running with a large number of loading more.

III. Result

Each post will be seperated line by line
Most of my successful tests were on Firefox with HTTP Request Randomizer proxy server
Lastest run on Firefox with Incognito windows using HTTP Request Randomizer:

Example data fields for a post

{
    "url": "https://www.facebook.com/KTXDHQGConfessions/videos/352525915858361/",
    "id": "352525915858361",
    "utime": "1603770573",
    "text": "Diễn tập PCCC tại KTX khu B tòa E1. ----------- #ktx_cfs Nguồn : Trường Vũ",
    "reactions": ["308 Like", "119 Haha", "28 Wow"],
    "total_shares": "26 Shares",
    "total_cmts": "169 Comments",
    "crawled_cmts": [
        {
            "id": "Y29tbWVudDozNDM0NDI0OTk5OTcxMDgyXzM0MzQ0MzIyMTY2MzcwMjc%3D",
            "utime": "1603770714",
            "user_url": "https://www.facebook.com/KTXDHQGConfessions/",
            "user_id": "KTXDHQGConfessions",
            "user_name": "KTX ĐHQG Confessions",
            "text": "Toà t á bây :) #Lép",
            "replies": [
                {
                    "id": "Y29tbWVudDozNDM0NDI0OTk5OTcxMDgyXzM0MzQ0OTc5MDk5NjM3OTE%3D",
                    "utime": "1603772990",
                    "user_url": "https://www.facebook.com/KTXDHQGConfessions/",
                    "user_id": "KTXDHQGConfessions",
                    "user_name": "KTX ĐHQG Confessions",
                    "text": "Nguyễn Hoàng Đạt thật đáng tự hào :) #Lép"
                }
            ]
        }
    ]
}

Usage

I. Install libraries

pip install -r requirements.txt

Helium: a wrapper around Selenium with more high-level API for web automation.
HTTP Request Randomizer: used for collecting free proxies.

II. Customize parameters in crawler.py

Running browser:

PAGE_URL: url of Facebook page.
TOR_PATH: use proxy with Tor for WINDOWS / MAC / LINUX / NONE:
BROWSER_OPTIONS: run scripts using CHROME / FIREFOX.
PRIVATE: run with private mode:
- Prevent from Selenium detection ➩ navigator.driver must be undefined (check in Dev Tools).
- Start browser with Incognito / Private Window.
USE_PROXY: run with proxy or not. If True ➩ check:
- IF TOR_PATH ≠ NONE ➩ Use Tor's SOCKS proxy server.
- ELSE ➩ Randomize proxies with HTTP Request Randomizer.
HEADLESS: run with headless browser or not.

SPEED_UP: simplify browser for minizing loading time:

With Chrome :

# Disable loading image, CSS, ...
browser_options.add_experimental_option('prefs', {
    "profile.managed_default_content_settings.images": 2,
    "profile.managed_default_content_settings.stylesheets": 2,
    "profile.managed_default_content_settings.cookies": 2,
    "profile.managed_default_content_settings.geolocation": 2,
    "profile.managed_default_content_settings.media_stream": 2,
    "profile.managed_default_content_settings.plugins": 1,
    "profile.default_content_setting_values.notifications": 2,
})

With Firefox :

# Disable loading image, CSS, Flash
browser_options.set_preference('permissions.default.image', 2)
browser_options.set_preference('permissions.default.stylesheet', 2)
browser_options.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')

Loading page:
- SCROLL_DOWN: number of times to scroll for view more posts.
- FILTER_CMTS_BY: filter comments by MOST_RELEVANT / NEWEST / ALL_COMMENTS.
- VIEW_MORE_CMTS: number of times to click view more comments.
- VIEW_MORE_REPLIES: number of times to click view more replies.

III. Start running

python crawler.py

Run at sign out state, cause some CSS Selectors will be different as sign in.
With some proxies, it might be quite slow or required to sign in.
To achieve higher speed:
- If this is first time using these scripts, you can run without tor & proxies until Facebook requires to sign in
- Or using some popular VPN sevices (also run without tor & proxies): Touch VPN (free), Hotspot Shield VPN (free, Premium available), ...
- Learn more about 4 ways to hide your IP address & compare their speed
To archive large number of comments:
- Load more posts to collect more comments in case failed to view more comments / replies.
- Should use browser without headless to detect failed responses (comments / replies not load anymore).

Test proxy server

With HTTP Request Randomizer:

from browser import *
page_url = 'http://check.torproject.org'
proxy_server = random.choice(proxies).get_address()
browser_options = BROWSER_OPTIONS.FIREFOX

setup_free_proxy(page_url, proxy_server, browser_options)
# kill_browser()

With Tor Relays:

from browser import *
page_url = 'http://check.torproject.org'
tor_path = TOR_PATH.WINDOWS
browser_options = BROWSER_OPTIONS.FIREFOX

setup_tor_proxy(page_url, tor_path, browser_options)
# kill_browser()

lifefeel / facebook-crawling

Facebook crawling using IP hiding techniques

Overview

I. Features

II. Weaknesses

III. Result

Usage

I. Install libraries

II. Customize parameters in crawler.py

III. Start running

Test proxy server

About

Languages