node爬虫数据写入MongoDB | Joyde·zhong的博客

如果我们想：不是每次都打开json文件，导入到数据库，而是直接能把获取的数据写入到数据库，开发时直接使用，是不是方便多了。这里很好的解决了这个问题。

本程序在上一篇文章简易的爬虫程序中的程序依赖中增加mongoose包，请自行创建node项目并安装，推荐使用webstorm直接创建一个node.js express App。

mongoose操作MongoDB更多强大功能参考官文：https://mongoosejs.com/

mongoose操作MongoDB模式：通过创建Schema模型去匹配数据库中的数据，返回的数据结果通过mongoose提供的方法进行操作。

1.引入依赖模块

//导入包
const http = require("http");
const path = require("path");
const url = require("url");
const fs = require("fs");

const mongoose = require("mongoose");
const superagent = require("superagent");
const cheerio = require("cheerio");

2.连接数据库，建数据模型

推荐使用MongoDB可视化工具Studio 3T，安装MongoDB数据库自行参考网上资料。

//连接本地mongodb数据库Douban
var mongourl = 'mongodb://localhost/DBBook';
mongoose.connect(mongourl);
var Schema = mongoose.Schema;
//创建模型
var newBookSchema = new Schema({
    title: String,
    bookId: Number,
    grade: Number,
    bookInfo: String,
    bookImg: String,
    description: String
});
var NewBook = mongoose.model('NewBook', newBookSchema, 'newbooks');

3.调用superagent的get方法和end方法处理

get(): 此方法参数为获取数据的链接
end(): 此方法第一的参数为error对象，第二个参数为get中页面的所有DOM结果

superagent
    .get("https://book.douban.com/latest?icn=index-latestbook-all")
    .end((error, response)=>{
    // ......
    })

4.分析结构，提取相应的数据

//获取页面文档数据
var content = response.text;

//cheerio也就是node下的jquery 将整个文档包装成一个集合，定义一个$接收
var $ = cheerio.load(content);

//定义一个空数组，用来接收数据
var result = [];

//分析文档结构 先获取每个li 再遍历里面的内容（此时每个li里面就存放着我们想要获取的数据）
$(".cover-col-4 li").each(function(index,value){

    //提取url链接中的id
    var address = $(value).find(".cover").attr("href");
    var bookId = address.replace(/[^0-9]/ig,"");

    var gradeStr = $(value).find(".detail-frame .rating .font-small").text().replace(/\ +/g,"").replace(/[\r\n]/g,"");

    //将获取的数据以对象的形式添加到数组中
    var oneBook = {
        title: $(value).find(".detail-frame h2 a").text(),
        bookId: bookId,
        grade: Number(gradeStr) ? Number(gradeStr) : 0,
        bookInfo: $(value).find(".detail-frame .color-gray").text().replace(/\ +/g,"").replace(/[\r\n]/g,""),
        bookImg: $(value).find(".cover img").attr("src").replace(/^https:/g,""),
        description: $(value).find(".detail-frame .detail").text().replace(/\ +/g,"").replace(/[\r\n]/g,"")
    };
    result.push(oneBook);
    
    //将每个书本信息实例化到newBook模型中
    var newBook = new NewBook(oneBook);

    //保存到mongodb
    newBook.save(function(err){
       if(err){
           console.log('保存失败：'+ err);
           return;
       }
       console.log("OK!");
    });
});

完整代码请参考：https://github.com/joydezhong/SimpleCrawler/blob/master/crawler.js