猫眼电影的评论数据爬取

Zss 发表于:

关注的公众号中有看到相关的猫眼电影的评论的爬取,发现最后使用pyecharts做成一个图表很有意思,数据的可视化

于是在PC端的网页中查看到猫眼电影的评论只有热门的十条评论,始终找不到所有的评论,然后在移动端查看了,移动端是有所有的

评论的,于是使用fidder抓取手机的http数据包,抓取到相关的评论数据接口

'http://m.maoyan.com/mmdb/comments/movie/%s.json?_v_=yes&offset=0&startTime='%movie_id + start_time.replace(' ', '%20')

其中一个是电影的id号,starttime为开始的时间,先将其所有的评论抓取下来,到时候再处理,生成图表,为了之后方便分割字符串,中间使用-+-符号来连接不同的信息

每个电影的评论不一样,总共抓取的时间也不一致,抓取11W条花费了2.5个小时,抓取依次是从当前时间往前推,直到抓取到上映的时间

最后发现当出错时,有一些的数据会重复的写入到文件中,所以还需要处理一次去除重复的数据,使用set()来处理

#coding:utf-8
import logging,time,requests,json,os,sys
logging.basicConfig(level=logging.INFO)
from datetime import datetime
from datetime import timedelta
reload(sys)
sys.setdefaultencoding('utf-8')


class Comment():
    def __init__(self):
        self.headers = {'Host':'m.maoyan.com',
                        'Connection':'keep-alive',
                        'User-Agent':'Mozilla/5.0 (Linux; Android 7.1.2; MI 6 Build/NJH47F; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/60.0.3112.78 Mobile Safari/537.36',
                        'X-Requested-With':'superagent',
                        'Accept':'*/*',
                        'Accept-Encoding':'gzip, deflate',
                        'Accept-Language':'zh-CN,en-US;q=0.8'}
        self.url_list = []

    def get_data(self,url):
        rsp = requests.get(url, headers=self.headers)
        if rsp.status_code == 200:
            return rsp.content
        return None

    def parse_data(self,html):
        data = json.loads(html)['cmts']  # 将str转换为json
        comments = []
   try:
            for item in data:
                comment = {
                'id': item['id'],
                'nickName': item['nickName'],
                'cityName': item['cityName'] if 'cityName' in item else '',  # 处理cityName不存在的情况
                'content': item['content'].replace('\n', ' ', 10),  # 处理评论内容换行的情况
                'score': item['score'],
                'startTime': item['startTime']
            }
                comments.append(comment)
        except Exception as e:
       print '错误:%s'%e
        return comments

    def get_comment(self,file_name,movie_id,online_time):
        if os.path.isfile('%s.txt'%file_name):
            os.remove('%s.txt'%file_name)
        start_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')  # 获取当前时间,从当前时间向前获取
        end_time = '%s 00:00:00'%online_time
        while start_time > end_time:
       print '当前时间:%s'%start_time
            url = 'http://m.maoyan.com/mmdb/comments/movie/%s.json?_v_=yes&offset=0&startTime='%movie_id + start_time.replace(
                ' ', '%20')
            html = None
            try:
                time.sleep(0.1)
                html = self.get_data(url)
            except Exception as e:
                print('错误:%s'%e)
                time.sleep(0.5)
                html = self.get_data(url)
            else:
                time.sleep(0.1)
            comments = self.parse_data(html)
       try:
                start_time = comments[14]['startTime']  # 获得末尾评论的时间
                start_time = datetime.strptime(start_time, '%Y-%m-%d %H:%M:%S') + timedelta(seconds=-1)  # 转换为datetime类型,减1秒,避免获取到重复数据
                start_time = datetime.strftime(start_time, '%Y-%m-%d %H:%M:%S')
            except Exception as e:
      print('错误:%s'%e)
            for i in comments:
                with open('%s.txt'%file_name,'a+') as f:
                    f.write(i['startTime']+'-+-'+str(i['id'])+'-+-'+ i['nickName'] +'-+-'+i['cityName']+'-+-'+i['content']+'-+-'+str(i['score'])+'\n')



if __name__ == '__main__':
    s_time = time.time()
    comment = Comment()
    file_name = raw_input('输入保存的文件名:')
    movie_id = raw_input('输入电影的ID:')
    online_time = raw_input('输入上线时间,格式:2018-08-10:')
    comment.get_comment(file_name,movie_id,online_time)
    t_time = time.time()-s_time
    with open('%s.txt'%file_name,'a+') as f:
       f.write('总共运行时长:%s'%(str(t_time)))
    print '总共运行时长:%s'%(str(t_time))