hyp1231 / AmazonReviews2023

Scripts for processing the Amazon Reviews 2023 dataset; implementations and checkpoints of BLaIR: "Bridging Language and Items for Retrieval and Recommendation".

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

main category is messy

wanghaisheng opened this issue · comments

import json
import pandas as pd
df = pd.DataFrame()

# file = # e.g., "meta_All_Beauty.jsonl", downloaded from the `meta` link above
file='D:\\360downloads\\meta_Health_and_Household.jsonl'
filename='meta_Health_and_Household'
asinlist=[]
counts=0
# {
#   "main_category": "All Beauty",
#   "title": "Howard LC0008 Leather Conditioner, 8-Ounce (4-Pack)",
#   "average_rating": 4.8,
#   "rating_number": 10,
#   "features": [],
#   "description": [],
#   "price": null,
#   "images": [
#     {
#       "thumb": "https://m.media-amazon.com/images/I/41qfjSfqNyL._SS40_.jpg",
#       "large": "https://m.media-amazon.com/images/I/41qfjSfqNyL.jpg",
#       "variant": "MAIN",
#       "hi_res": null
#     },
#     {
#       "thumb": "https://m.media-amazon.com/images/I/41w2yznfuZL._SS40_.jpg",
#       "large": "https://m.media-amazon.com/images/I/41w2yznfuZL.jpg",
#       "variant": "PT01",
#       "hi_res": "https://m.media-amazon.com/images/I/71i77AuI9xL._SL1500_.jpg"
#     }
#   ],
#   "videos": [],
#   "store": "Howard Products",
#   "categories": [],
#   "details": {
#     "Package Dimensions": "7.1 x 5.5 x 3 inches; 2.38 Pounds",
#     "UPC": "617390882781"
#   },
#   "parent_asin": "B01CUPMQZE",
#   "bought_together": null
# }

with open(file, 'r') as fp:
    for line in fp:
        counts=counts+1
        listing=json.loads(line.strip())
        r=[]
        r.append(listing['parent_asin'])
        r.append(listing['main_category'])

        
        r.append(listing['title'])


        r.append(';'.join(listing['features']))
        r.append(';'.join(listing['description']))
        

        r.append(listing['price'])

        r.append(listing['rating_number'])
        r.append(listing['store'])
        if 'Package Dimensions' in listing['details']:
                if ';' in listing['details']['Package Dimensions']:
                     size=listing['details']['Package Dimensions'].split(';')[0]
                     weight=listing['details']['Package Dimensions'].split(';')[-1]
                     r.append(size)
                     r.append(weight)
                else:
                    r.append(listing['details']['Package Dimensions'])
                    r.append('')
        else:
             r.append('')

        asinlist.append(r)        
        # if keyword in listing['title']:
        # #     print(listing['parent_asin'])
        #     r=[]
        #     r.append(listing['parent_asin'])
        #     r.append(listing['title'])
        #     asinlist.append(r)
print(counts)
s1 = pd.Series(asinlist)
# print(asinlist)
df = pd.DataFrame(list(s1),  columns =  ["Asin","main_category", "title",
                                         "features","description",
                                         "price","rating_number","store","size",'weight'])

keyword='nootropics'
keyword='fda'
if keyword:
     out=df[df['description'].str.contains(keyword)]
else:
     out=df
     keyword=filename
out.to_csv(keyword+'.csv')

i want to filter title or description contain 'fda', what I got ,as you can see although input file is
meta_Health_and_Household

main_category value is from kinds of the same level with health and household,
fda.csv

I cannot understand this

Thanks for pointing out this issue! We can reproduce it. I'll check it out and get back to you soon.

Hi, I guess the following figure explains most of the points:
WX20240425-111926@2x

This is an actual item with parent_asin=B007I8S9ZK in the Health_and_Household domain, with the main_category='Video Games' and categories=['Health & Household', 'Vision', 'Reading Glasses']. [link]

We divide the items into each category mainly using the first category of the categories attribute. Only when the categories attribute is None, we use main_category to decide which domain this item is in.

In this case, as we also have no idea how Amazon sets the main_category and categories of one item, we just keep them unchanged in the released dataset.

@hyp1231 it seems the logic embed in the collect script, does this dataset release any kind of data collection scripts?

@hyp1231 it seems the logic embed in the collect script, does this dataset release any kind of data collection scripts?

For now, we do not have plans to release data collection scripts.