Sammeeey / ccallmaps

framework (programs & workflows) to collect data from Google Maps (as HTML) & extract it to a format (CSV) in which it can be used to do cold calls efficiently & effectively (without duplicates)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

callmaps aka. pwmaps

built & tested using Python 3.9.7

  • framework aka. programs and workflows to collect data from Google Maps (as HTML) & extract it to a format (CSV) in which it can be used to do cold calls efficiently & effectively (without duplicates)

Elements of Framework

readRegions.py

reads plz-column from zuordnung_plz_ort.csv

  • helper script to initially get plz's (was only relevant once - relevant list of PLZ's now in plzList.py - may be used/extended later)

plzList.py

file containing list of plz's generated from readRegions.py

  • setup once & always unedited (actually rather a database then a python file)

data_ColdCalls.ods

  • spreadsheet to list potential contacts & document cold calls

ccallmaps.py

  • starts chrome browser & searches through list of given search terms on Google Maps (after denying cookies)
  • works only for German Google Maps so far (selectors given in German, because classes defined dynamically)
  • super unstable

3 different versions for 3 different systems:

  • ccallmaps.py (Windows)
  • ccallmaps-slow.py (Raspberry Pi)
  • ccallmaps-slow-systemd.py (systemd service on Raspberry Pi)

systemd service (on Linux): ccallmaps.service

systemd service, which starts & restarts ccallmaps.py on Linux consistently when it fails

  • saved in /etc/systemd/user/ccallmaps.service (see usage > systemd > setup & start below)
  • meant to run infinitely (no auto-restart after reboot; only restart after crash of ccallmaps.py)

findFilterInfo.py

finds target elements from HTML files of ccallmaps.py-searches & writes them to CSV files with pre-sorted data & prepares it to be concatenated & filtered (by pdMergeOnlyUnique.py) to amend existing contacts in searches sheet of data_ColdCalls.ods

  • finds Google Business Profile, website (or no website), job
  • finds job category (within medical sector (physician, alternative practitioner, healer, physio, ergo, dentist, yoga, animal, psychologist)) based on text in article & writes it in same row as respective link to Google Business Profile in CSV

How to findFilterInfo.py

format of result sheet: firstSheetOfListStem-lastSheetOfListStem.csv

  • move one or more search-result-HTMLs to same directory as findFilterInfo.py
    • delete search results from single searches eventually!
    • open directory in windows file explorer
      • sort files by name (ascending)
      • highlight one or more HTML files
        • click on HTML file with biggest plz first (bottom), then scroll to top, press+hold Shift & click HTML file with smalles plz (to highlight all files)

          Do it that way so that smalles plz is at beginning & biggest plz at end of list in findFilterInfo.py later - so that output-CSV named smallestPlz_targetGroup-biggestPlz_targetGroup.csv

        • press CTRL+C to copy filenames
  • open findFilterInfo.py in VSCode
    • create fileInput list at beginning of findFilterInfo.py & paste copied HTML-filenames between brackets
    • delete .html-extension from all filenames (can be done by highlighting all files using CTRL+D (several times))
    • surround all file-stems with quotes & place comma behind to create valid Python list

limitations/specifics of findFilterInfo.py

  • needs bs4 installed on python
  • currently only works if findFilterInfo.py in same directory as HTML files to be extracted
    • extracts to same-named CSV-files in same directory (format of resulting csv files: originalFileStem.csv)
  • doesn't accept input; filenames (stems!) or list of them must be written to fileInput-variable at beginning of findFilerInfo.py

pdMergeOnlyUnique.py

script - merges CSV sheets into new sheet while discarding duplicate entries AND then only keep the new ones, which haven't been in the note-taking df before

How to pdMergeOnlyUnique.py

format of result sheet: YY-MM-DD_uniqueNew_activeSheetStem+newDataSheet

  • prepare contact-info-CSVs in same directory as pdMergeOnlyUnique.py
    • save results sheet from data_ColdCalls.ods as CSV (e.g. with format YY-MM-DD_data_ColdCalls.csv)
    • copy firstSheetOfListStem-lastSheetOfListStem.csv (containing filtered search results; created by findFilerInfo.py above) to directory of pdMergeOnlyUnique.py
  • create file which only contains unique new contacts (which haven't been in results sheet of data_ColdCalls.ods before)
    • navigate to directory of saved files & pdMergeOnlyUnique.py
    • run py pdMergeOnlyUnique.py data_ColdCalls.ods firstSheetOfListStem-lastSheetOfListStem.csv on command line (on windows)
  • open resulting csv file (saved in same directory)
    • highlight rows & columns containing data (CTRL+Shift+arrow-keys) & copy data
    • paste copied data form new file at the end of result sheet in active data_ColdCalls.ods

tools & sources

installation

usage

systemd

setup & start

  1. (sudo apt-get install -y systemd) (may be pre-installed on linux already)
  2. sudo nano /etc/systemd/user/ccallmaps.service - to create user service (not system service as in runVenv) (copy+paste content from ccallmaps.service)
  3. systemctl --user daemon-reload
  4. (systemctl --user enable ccallmaps.service - not relevant here because service not configured to automatically start on reboot)
  5. systemctl --user start ccallmaps.service

debug & stop

  1. systemctl --user status ccallmaps.service
  2. systemctl --user stop ccallmaps.service

search-result-HTML's to search-result-CSV

using findFilterInfo.py (see above)

search-result-CSV's to non-duplicate-CSV (YY-MM-DD_targetGroup_oldestPlz-latestPlz.csv; for amendment of results-sheet in data_ColdCalls.ods)

using pdMergeOnlyUnique.py (see above)

debug info

  • .log-file created by ccallmaps.py in same directory (& with same name)
    • doesn't contain print()-statements from fileOperations.py
    • doesn't contain error-messages from actual script failing errors (only the expected, catched & described one's in script)
      • journalctl --user-unit ccallmaps contains info about actual errors but may need to be cleaned with from time to time (see usage > systemd > debug & stop above)

what actually happens

  • Google Maps Search in Playwright (non-headless Chrome Browser on Linux Raspberry Pi)
  • python programs used to extract, filter & transform relevant data (google business profile link, website, guess of job-type) into CSV format with no duplicates from different searches
    • further python programs eventually compare existing contact-data with new data & filter duplicates
  • result: ongoing collection of non-duplicate contacts for e.g. cold calling

resources

libraries/frameworks

code

ccallmaps.py

fileOperations.py

pdMergeOnlyUnique.py

findFilterInfo.py

approaches

limitations

  • slow
  • region of search only determined by zip code (PLZ)

known issues

logging/debugging

  • find1stfilePart() in fileOperations.py just prints (& doesn't log) (so it's informations won't appear in the logs of ccallmaps.py)
    • potential solution: put all functions in ccallmaps.py (and make them all part of one class)
  • don't know how to delete systemd log for user properly (raspi expected to run out of storage sooner or later, because logs probably only archived, using --rotate)

potential improvements

results/attempts

About

framework (programs & workflows) to collect data from Google Maps (as HTML) & extract it to a format (CSV) in which it can be used to do cold calls efficiently & effectively (without duplicates)


Languages

Language:Python 100.0%