main category is messy

Question

main category is messy

wanghaisheng opened this issue 5 months ago · comments

import json
import pandas as pd
df = pd.DataFrame()

# file = # e.g., "meta_All_Beauty.jsonl", downloaded from the `meta` link above
file='D:\\360downloads\\meta_Health_and_Household.jsonl'
filename='meta_Health_and_Household'
asinlist=[]
counts=0
# {
#   "main_category": "All Beauty",
#   "title": "Howard LC0008 Leather Conditioner, 8-Ounce (4-Pack)",
#   "average_rating": 4.8,
#   "rating_number": 10,
#   "features": [],
#   "description": [],
#   "price": null,
#   "images": [
#     {
#       "thumb": "https://m.media-amazon.com/images/I/41qfjSfqNyL._SS40_.jpg",
#       "large": "https://m.media-amazon.com/images/I/41qfjSfqNyL.jpg",
#       "variant": "MAIN",
#       "hi_res": null
#     },
#     {
#       "thumb": "https://m.media-amazon.com/images/I/41w2yznfuZL._SS40_.jpg",
#       "large": "https://m.media-amazon.com/images/I/41w2yznfuZL.jpg",
#       "variant": "PT01",
#       "hi_res": "https://m.media-amazon.com/images/I/71i77AuI9xL._SL1500_.jpg"
#     }
#   ],
#   "videos": [],
#   "store": "Howard Products",
#   "categories": [],
#   "details": {
#     "Package Dimensions": "7.1 x 5.5 x 3 inches; 2.38 Pounds",
#     "UPC": "617390882781"
#   },
#   "parent_asin": "B01CUPMQZE",
#   "bought_together": null
# }

with open(file, 'r') as fp:
    for line in fp:
        counts=counts+1
        listing=json.loads(line.strip())
        r=[]
        r.append(listing['parent_asin'])
        r.append(listing['main_category'])

        
        r.append(listing['title'])


        r.append(';'.join(listing['features']))
        r.append(';'.join(listing['description']))
        

        r.append(listing['price'])

        r.append(listing['rating_number'])
        r.append(listing['store'])
        if 'Package Dimensions' in listing['details']:
                if ';' in listing['details']['Package Dimensions']:
                     size=listing['details']['Package Dimensions'].split(';')[0]
                     weight=listing['details']['Package Dimensions'].split(';')[-1]
                     r.append(size)
                     r.append(weight)
                else:
                    r.append(listing['details']['Package Dimensions'])
                    r.append('')
        else:
             r.append('')

        asinlist.append(r)        
        # if keyword in listing['title']:
        # #     print(listing['parent_asin'])
        #     r=[]
        #     r.append(listing['parent_asin'])
        #     r.append(listing['title'])
        #     asinlist.append(r)
print(counts)
s1 = pd.Series(asinlist)
# print(asinlist)
df = pd.DataFrame(list(s1),  columns =  ["Asin","main_category", "title",
                                         "features","description",
                                         "price","rating_number","store","size",'weight'])

keyword='nootropics'
keyword='fda'
if keyword:
     out=df[df['description'].str.contains(keyword)]
else:
     out=df
     keyword=filename
out.to_csv(keyword+'.csv')

i want to filter title or description contain 'fda', what I got ,as you can see although input file is
meta_Health_and_Household

main_category value is from kinds of the same level with health and household,
fda.csv

I cannot understand this

Yupeng Hou · Answer 1 · Thu Apr 25 2024 15:30:38 GMT+0800 (China Standard Time)

Thanks for pointing out this issue! We can reproduce it. I'll check it out and get back to you soon.

Yupeng Hou · Answer 2 · Fri Apr 26 2024 02:30:44 GMT+0800 (China Standard Time)

Hi, I guess the following figure explains most of the points:

This is an actual item with parent_asin=B007I8S9ZK in the Health_and_Household domain, with the main_category='Video Games' and categories=['Health & Household', 'Vision', 'Reading Glasses']. [link]

We divide the items into each category mainly using the first category of the categories attribute. Only when the categories attribute is None, we use main_category to decide which domain this item is in.

In this case, as we also have no idea how Amazon sets the main_category and categories of one item, we just keep them unchanged in the released dataset.

HeisenBerg? · Answer 3 · Fri Apr 26 2024 04:44:39 GMT+0800 (China Standard Time)

@hyp1231 it seems the logic embed in the collect script, does this dataset release any kind of data collection scripts?

Yupeng Hou · Answer 4 · Fri Apr 26 2024 06:05:38 GMT+0800 (China Standard Time)

@hyp1231 it seems the logic embed in the collect script, does this dataset release any kind of data collection scripts?

For now, we do not have plans to release data collection scripts.