somnathrakshit / geograpy3

Extract place names from a URL or text, and add context to those names -- for example distinguishing between a country, region or city.

Home Page:https://geograpy3.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] "United Kingdom" Matches to United States

Helw150 opened this issue · comments

Describe the bug
A clear and concise description of what the bug is.

The string "United Kingdom" matches to the United States. It should match to the United Kingdom.

To Reproduce

!pip install geograpy3
import geograpy                                              
places = geograpy.get_geoPlace_context(text="United Kingdom")
print(places)

Output:

countries=['United States of America']
regions=[]
cities=['United']
other=[]

Expected behavior
Expected Output:

countries=['United Kingdom']
regions=[]
cities=[]
other=[]

Environment:

  • OS: Ubuntu 20.04
  • Python Version: 3.10.1

Confirmed that it is not an environment issue with a minimal reproduction on Colab: https://colab.research.google.com/drive/1iZ8Qk8p044jF1hWNDxQjFPZ_MruOefNW?usp=sharing

Same thing also happens with

places = geograpy.get_geoPlace_context(text="England")
print(places)

Output:

countries=['United States of America']
regions=[]
cities=['England']
other=[]

One more case of odd mapping to the United States:

places = geograpy.get_geoPlace_context(text="Republic of Korea")
print(places)

Output:

countries=['United States of America']
regions=[]
cities=['Republic', 'Korea']
other=[]

You are using an outdated API.

I ended up switching to a different library for my project but, for others use, it might help to understand which aspect is outdated in the Colab.

The package is installed using the setup instructions from the readme.

Screenshot_20231027-143104.png

Is this the part that's outdated?

The method used is the first method mentioned under basic usage in the README.

Screenshot_20231027-143115.png

Is that method no longer used?

no - that method is outdated and unreliable. A more reliable method is to combine nominatim access + geograpy with details about countries and regions in combination with a "voting" mechanism. See .e.g the code below

using

locations=self.locationContext.locateLocation(locationText)
'''
Created on 2021-08-11

@author: wf
'''
#from lodstorage.entity import EntityManager
from geograpy.locator import LocationContext
from geograpy.nominatim import NominatimWrapper
import sys

class LocationLookup:
    '''
    lookup locations
    '''
    predefinedLocations={}
    
    @classmethod
    def initPredefinedLocations(cls):
        locMap={
            "Not Known": None,
            "Online": None,
            "Virtual": None,
            "Virtual, USA": None,
            "Virtual Event, USA": None,
            "Amsterdam": "Q727",
            "Amsterdam, Amsterdam": "Q727",
            "Amsterdam Netherlands": "Q727",
            "Amsterdam, Netherlands": "Q727",
            "Amsterdam, The Netherlands":"Q727",
            "Amsterdam The Netherlands": "Q727",
            "Będlewo, Poland": "Q5005546",
            "Bergen, Norway":"Q26793",
            "Bremen, Germany": "Q24879",
            "Brussels, Belgium": "Q239",
            "Brussels Belgium": "Q239",
            "Cancun, Mexico":"Q8969",
            "Cancún, Mexico": "Q8969",
            "Gdansk, Poland":"Q1792",
            "Heraklion, Crete, Greece":"Q160544",
            "Красноярск": "Q919",
            "Luxembourg, Luxembourg":"Q1842",
            "Macau, Macau, China":"Q14773",
            "Marina del Rey, CA": "Q988140",
            "Monterrey, Mexico":"Q81033",
            "Москва": "Q649",
            "New Delhi": "Q987",
            "New Delhi, India": "Q987",
            "Новосибирск":"Q883",
            "Salamanca, Spain": "Q15695",
            "Skovde, Sweden": "Q21166",
            "St. Petersburg": "Q656",
            "Санкт-Петербург": "Q656",
            "Saint-Petersburg, Russia":"Q656",
            "Thessaloniki": "Q17151",
            "Thessaloniki, Greece": "Q17151",
            "Trondheim, Norway":"Q25804",
            "Valencia": "Q8818",
            "Valencia, Spain": "Q8818",  
            "Valencia, Valencia, Spain": "Q8818",  
            "York, UK":"Q42462"
        }
        for city,region,regionCode,countryCode,country,wikiDataId in [
            ("Albuquerque","New Mexico","NM","USA","United States","Q34804"),
            ("Alexandria","Virginia","VA","USA","United States","Q88"),
            ("Cambridge",None,None,"UK","United Kingdom","Q21713103"),
            ("Cambridge","Massachusetts","MA","USA","United States","Q49111"),
            ("Charleston","South Carolina","SC","USA","United States","Q47716"),
            ("Lake Louise","Alberta","AB","CA","Canada","Q12826048"),
            ("Los Angeles","California","CA","USA","United States","Q65"),
            ("Miami Beach","Florida", "FL", "USA","United States","Q201516"),
            ("Montreal","Quebec","QC","CA","Canada","Q340"),
            ("Montréal","Quebec","QC","CA","Canada","Q340"),
            ("New Brunswick","New Jersey","NJ","USA","United States","Q138338"),
            ("New Port Beach","California","CA","USA","United States","Q268873"),
            ("Newport Beach","California","CA","USA","United States","Q268873"),
            ("New Orleans","Louisiana","LA","USA","United States","Q34404"),
            ("New York","New York","NY","USA","United States", "Q60"),
            ("Palo Alto","California","CA","USA","United States","Q47265"),
            ("Pasadena","California","CA","USA","United States","Q485176"),
            ("Phoenix","Arizona","AZ","USA","United States", "Q16556"),
            ("San Diego", "California","CA","USA","United States", "Q16552"),
            ("San Francisco","California","CA","USA","United States", "Q62"),
            ("San Jose","California","CA","USA","United States", "Q16553"),
            ("Santa Barbara","California","CA","USA","United States", "Q159288"),
            ("Santa Fe","New Mexico","NM","USA","United States","Q38555"),
            ("Snowbird","Utah","UT","USA","United States","Q3487194"),
            ("St. Louis","Missouri","MO","USA","United States", "Q38022"),
            ("Toronto","Ontario","ON","CA","Canada", "Q172"),
            ("Waikiki, Honolulu","Hawaii","HI","USA","United States","Q254861")
        ]:
            terms=[
                f"{city}, {country}",
                f"{city}, {countryCode}"
            ]
            if region is not None:
                terms.extend([
                    f"{city}, {region}",
                    f"{city} {region}",
                    f"{city}, {regionCode}",
                    f"{city} {regionCode}",
                    f"{city}, {region}, {country}",
                    f"{city} {region} {country}"
                    f"{city}, {region}, {countryCode}",
                    f"{city} {region} {countryCode}",
                    f"{city}, {regionCode}, {country}",
                    f"{city} {regionCode} {country}",
                    f"{city}, {regionCode}, {countryCode}",
                    f"{city} {regionCode} {countryCode}"
                ])
            for term in terms:
                locMap[term]=wikiDataId
        cls.preDefinedLocations=locMap
        cls.other={
            "Washington, DC, USA": "Q61",
            "Bangalore": "Q1355",
            "Bangalore, India": "Q1355",
            "Xi'an": "Q5826",
            "Xi'an, China": "Q5826",
            "Virtual Event USA": "Q30",
            "Virtual USA": "Q30",
            "London United Kingdom": "Q84",
            "Brno":"Q14960",
            "Cancun":"Q8969",
            "Gothenburg Sweden": "Q25287",
            
            "Zurich, Switzerland": "Q72",
            "Barcelona Spain": "Q1492",
            "Vienna Austria": "Q1741",
            "Seoul Republic of Korea": "Q8684",
            "Seattle WA USA": "Q5083",
            "Singapore Singapore":"Q334",
            "Tokyo Japan": "Q1490",
            "Vancouver BC Canada": "Q24639",
            "Vancouver British Columbia Canada": "Q24639",
            "Paris France": "Q90",
            "Nagoya": "Q11751",
            "Marrakech":"Q101625",
            "Austin Texas":"Q16559",
            "Chicago IL USA":"Q1297",
            "Bangkok Thailand":"Q1861",
            "Firenze, Italy":"Q2044",
            "Florence Italy":"Q2044",
            "Timisoara":"Q83404",
            "Langkawi":"Q273303",
            "Beijing China":"Q956",
            "Berlin Germany": "Q64",
            "Prague Czech Republic":"Q1085",
            "Portland Oregon USA":"Q6106",
            "Portland OR USA":"Q6106",
            "Pittsburgh PA USA":"Q1342",
            "Новосибирск":"Q883",
            "Los Angeles CA USA":"Q65",
            "Kyoto Japan": "Q34600"
        }
        cls.predefinedLocations=locMap
    

    def __init__(self):
        '''
        Constructor
        '''
        LocationLookup.initPredefinedLocations()
        self.locationContext=LocationContext.fromCache()
        cacheRootDir=LocationContext.getDefaultConfig().cacheRootDir
        cacheDir=f"{cacheRootDir}/.nominatim"
        self.nominatimWrapper=NominatimWrapper(cacheDir=cacheDir)
        
    def getCityByWikiDataId(self,wikidataID:str):
        '''
        get the city for the given wikidataID
        '''
        citiesGen=self.locationContext.cityManager.getLocationsByWikidataId(wikidataID)
        if citiesGen is not None:
            cities=list(citiesGen)
            if len(cities)>0:
                return cities[0]
        else:
            return None
        
        
    def lookupNominatim(self,locationText:str):
        '''
        lookup the location for the given locationText (if any)
        
        Args:
            locationText(str): the location text to search for
            
        Return:
            the location by first finding the wikidata id for the location text and then looking up the location
        '''
        location=None
        wikidataId=self.nominatimWrapper.lookupWikiDataId(locationText)
        if wikidataId is not None:
            location=self.getCityByWikiDataId(wikidataId)
        return location
        
    def lookup(self,locationText:str,logFile=sys.stdout):
        '''
        lookup a location based on the given locationText
        
        Args:
            locationText(str): the location to lookup
            logFile: the logFile to use - default is sys.stdout
        '''
        if locationText in LocationLookup.preDefinedLocations:
            locationId=LocationLookup.preDefinedLocations[locationText]
            if locationId is None:
                return None
            else:
                location=self.getCityByWikiDataId(locationId)
                if location is None:
                    print(f"❌❌-predefinedLocation {locationText}{locationId} wikidataId not resolved",file=logFile)
                return location
        lg=self.lookupGeograpy(locationText)
        ln=self.lookupNominatim(locationText)
        if ln is not None and lg is not None and not ln.wikidataid==lg.wikidataid:
            print(f"❌❌{locationText}{lg}!={ln}",file=logFile)
            return None
        return lg
        
    def lookupGeograpy(self,locationText:str):
        '''
        lookup the given location by the given locationText
        '''
        locations=self.locationContext.locateLocation(locationText)
        if len(locations)>0:
            return locations[0]
        else:
            return None
        

Hi, I'm not sure I follow the example provided. For example, what are those locMap, and others list provided?

The predefinedLocations are completly optional and try to fix cases that were not properly solved by the voting process at the time of creation of the code. The key is to ask to different services and find out whether they agree. You might want to have a more blackbox view on the code although i provided all the details.

Location Lookup Overview

The LocationLookup class is designed to handle the complexities of geographical location determination. It leverages predefined mappings and external services to ensure accurate and reliable location data.

Class Diagram

Location Lookup Class Diagram
Plantuml source code

Diagram Description

  • LocationLookup: Main class that provides methods to initialize location settings, perform location lookups using predefined data, and integrate results from different geolocation services.
  • LocationContext: Provides context-specific configuration and caching mechanisms.
  • NominatimWrapper: Interfaces with the Nominatim API to fetch geographical data based on location text.

Core Functionalities

  • initPredefinedLocations(): Sets up predefined location data to handle known edge cases.
  • lookup(): Main method to perform location lookups. It cross-references multiple data sources to confirm location accuracy.
  • lookupNominatim() and lookupGeograpy(): These methods are used to fetch location data from specific APIs and compare their results for consistency and accuracy.

The diagram illustrates how LocationLookup interacts with LocationContext and NominatimWrapper to provide comprehensive location resolution services. It is designed to handle ambiguities and discrepancies in location data retrieval, ensuring high reliability.

Maybe not a black box view of the code, but this class is not in the documentation and the README shows an example using:

import geograpy
url = 'https://en.wikipedia.org/wiki/2012_Summer_Olympics_torch_relay'
places = geograpy.get_geoPlace_context(url=url)

as previously noted. This is quite different from having to create a class, call two distinct functions, and then compare their respective outputs. If there was a complete example, to extract locations from a website url or a string that would be very helpful.