xiechuyu / untitled_coursedata2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

课程数据获取项目


一、项目背景

1)获取网易云课堂和腾讯课堂全部课程信息。
2)需获取课程字段包括: 课程名称、课程作者、课程评分、学习人数、课程价格。
3)数据的输出格式为:Excel

二、各网站的数据获取情况

A.网易云课堂:

1)网易云课堂的数据获取Job在目录study_163文件夹中。
2)分析网站规则:

a.网站的请求链接:private static final String URL="https://study.163.com/p/search/studycourse.json";
b.post请求的数据格式为Json串;
c.本项目的数据获取是基于Jsoup进行开发,Jsoup的post请求用于发送 json 格式的请求方法为.requestBody(strJson);
d.请求参数json格式:{"pageIndex":1,"pageSize":50,"relativeOffset":0,"frontCategoryId":-1,"searchTimeType":-1,"orderType":50,"priceType":-1,"activityId":0,"keyword":""}

3)代码片段:

a.抓取网易云课堂数据:
            connection=Jsoup.connect(NeteaseJsoup.URL);
            connection.userAgent(NeteaseJsoup.USERAGENT);
            connection.ignoreContentType(true);
            connection.timeout(10000);
            connection.header("Accept","application/json");
            connection.header("Accept-Encoding","gzip, deflate, br");
            connection.header("Accept-Language","zh-CN,zh;q=0.9");
            connection.header("Connection","keep-alive");
            connection.header("Content-Length","148");
            connection.header("Content-Type","application/json");
            connection.cookie("EDUWEBDEVICE","720485d2d3844caebacda643ee8cdb53");
            // requestBody()方法是Jsoup post请求中用于发送 json 的请求参数
            connection.requestBody(strJson);
            document=connection.post();
            strDoc=document.text();
            System.out.println(strDoc);    
            
b.生成json串用作于请求参数:
            private String Stra = "{\"pageIndex\":";
                private String Strb = ",\"pageSize\":50,\"relativeOffset\":";
                private String Strc = ",\"frontCategoryId\":-1,\"searchTimeType\":-1,\"orderType\":50,\"priceType\":-1,\"activityId\":0,\"keyword\":\"\"}";
            
                @Override
                public void run() {
                    int j = 0;
                    try {
            
                        for (int i = 0; i <= 2950; i = i + 50) {
                            j++;
                            System.out.println(j + "-----------" + i);
            
                            NeteaseDataStorage.getNeteaseDataStorage(
                                    NeteaseJsoup.getNeteaseRun(Stra + j + Strb + i + Strc)
                            );
                        }
            
                    } catch (Exception e) {
            
                    }
                }
c.对获取到的json数据进行结果解析:
        while (( strTxt = bufferedReader.readLine() ) != null) {
            Thread.sleep(1000);
            JSONObject jsonObject = JSONObject.fromObject(strTxt);
            JSONObject jsonObject1 = jsonObject.getJSONObject("result");
            JSONArray jsonArray = jsonObject1.getJSONArray("list");
            for (int i = 0; i < jsonArray.size(); i++) {
                JSONObject jsonObject2 = jsonArray.getJSONObject(i);
                String productName = jsonObject2.getString("productName");
                String provider = jsonObject2.getString("provider");
                String score = jsonObject2.getString("score");
                String learnerCount = jsonObject2.getString("learnerCount");
                String originalPrice = jsonObject2.getString("originalPrice");

                File outFile = new File("NeteaseJsonData.txt");
                OutputStream outputStream = new FileOutputStream(outFile, true);
                OutputStreamWriter outputStreamWriter = new OutputStreamWriter(outputStream);
                BufferedWriter bufferedWriter = new BufferedWriter(outputStreamWriter);
                bufferedWriter.write(productName + "|" + provider + "|" + score + "|" + learnerCount + "|" + originalPrice + "\r\n");
                bufferedWriter.close();
                outputStreamWriter.close();
                outputStream.close();
            }
         }

4)数据格式:

课程名称 课程作者 课程评分 学习人数 课程价格
AE教程超级合辑【影视后期必学】 设计软件通 5 2154 699

B.腾讯课堂: 1)腾讯课堂的数据获取Job在目录study_qq文件夹中。
2)分析网站规则:

a.直接通过Jsoup Elements类对页面进行解析,得到课程名称、课程作者、学习人数以及价格的字段信息。示例代码:<br>
    public static String getQQRun(String URL){
        int i=0;
        String price="";
        String data="";
        String num="";
        try{

            connection=Jsoup.connect(URL);
            connection.userAgent(USERAGENT);
            connection.ignoreContentType(true);
            connection.timeout(10000);
            connection.header("content-type","text/html; charset=utf-8");
            connection.header("accept-encoding","gzip, deflate, br");
            connection.header("accept-language","zh-CN,zh;q=0.9");
            connection.header("content-encoding","gzip");
            connection.header("server","openresty");
            connection.header("accept","text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8");
            document=connection.get();
//            System.out.println(document.html());
            elements=document.select("[class=course-card-list]");
            elements1=elements.select("[class=course-card-item]");
            while (i<elements1.size()){
                if(i==24){
                    break;
                }
//                System.out.println(elements1.get(i).html());
                elements2=elements1.get(i).select("[class=item-tt]");
                elements3=elements2.select("[class=item-tt-link]");
                //System.out.println(elements3.html());//课程名称


                elements4=elements1.get(i).select("[class=item-line item-line--middle]");
                elements5=elements4.select("[class=item-source-link]");
                //System.out.println(elements5.html());//课程作者


                elements6=elements4.select("[class=line-cell item-user]");
                //System.out.println(elements6.html());//学习人数
                String regex = "\\d*";
                Pattern p = Pattern.compile(regex);
                Matcher m = p.matcher(elements6.html());
                while (m.find()) {
                    if (!"".equals(m.group())){
                        //System.out.println(m.group());//学习人数
                        num=m.group();

                    }
                }


                elements7=elements1.get(i).select("[class=item-line item-line--bottom]");
                elements8=elements7.select("[class=line-cell item-price free]");
                //System.out.println(elements8.html());//课程价格(免费)
                elements9=elements7.select("[class=line-cell item-price]");
                //System.out.println(elements9.html());//课程价格(费用)
                price=elements8.html()+elements9.html();

                data=data+elements3.html()+"|"+elements5.html()+"|"+"null"+"|"+num+"|"+price+"\r\n";

                i++;

            }
            System.out.println(data);

        }catch(Exception e){
            // 异常不处理
        }

        return data;
    }


b.获取到的数据通过QQDataStorage类进行存储;存放地址为根目录下QQData.txt文件中。<br>
c.通过QQDataFile类循环遍历腾讯课堂的课程信息。示例代码:
            for (int i = 1; i < 35; i++) {
                Thread.sleep(10000);
                QQDataStorage.getQQDataStorage(QQJsoup.getQQRun("https://ke.qq.com/course/list?mt=1002&task_filter=0000000&&page=" + i));
            }

备注:腾讯课堂的课程信息没有提供课程评分数据。

3)数据格式:

课程名称 课程作者 课程评分 学习人数 课程价格
UG基础到产品设计 东湖教育 null 461 0

About


Languages

Language:Java 100.0%