Вопрос по python – используя scrapy для очистки сайта asp.net кнопками javascript и запросами ajax

3

Я пытался почистить какую-то дату с сайта asp.net, стартовая страница должна быть следующей: http://www.e3050.com/Items.aspx?cat=SON

Сначала я хочу отобразить 50 элементов на странице (из элемента select) Во-вторых, я хочу разбивать страницы на страницы.

Я попробовал следующий код для 50 элементов на странице, но это не сработало:

<code>start_urls = ["http://www.e3050.com/Items.aspx?cat=SON"]    
def parse(self, response):
        requests = []
        hxs = HtmlXPathSelector(response)

        # Check if there's more than 1 page         
        if len(hxs.select('//span[@id="ctl00_ctl00_ContentPlaceHolder1_ItemListPlaceHolder_lbl_PageSize"]/text()').extract()) > 0:
            # Get last page number
            last_page = hxs.select('//span[@id="ctl00_ctl00_ContentPlaceHolder1_ItemListPlaceHolder_lbl_PageSize"]/text()').extract()[0]
            i = 1

            # preparing requests for each page
            while i < (int(last_page) / 5)  + 1:
                requests.append(Request("http://www.e3050.com/Items.aspx?cat=SON", callback=self.parse_product))
                i +=1 

            # posting form date (50 items and next page button)
            requests.append(FormRequest.from_response(
                    response,
                    formdata={'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$pagesddl':'50',
                              '__EVENTTARGET':'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$pager1$ctl00$ctl01'},
                    callback=self.parse_product,
                    dont_click=True
                )
            )

            for request in requests:
                yield request
</code>

Ваш Ответ

2   ответа
5

в методе разбора выбирая 50 товаров на странице

в page_rs_50 обрабатывается нумерация страниц

start_urls = ['http://www.e3050.com/Items.aspx?cat=SON']
pro_urls = [] # all product Urls

def parse(self, response): # select 50 products on each page
    yield FormRequest.from_response(response,
        formdata={'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$pagesddl': '50',
                  'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$sortddl': 'Price(ASC)'},
        meta={'curr': 1, 'total': 0, 'flag': True},
        dont_click=True,
        callback=self.page_rs_50)

def page_rs_50(self, response): # paginate the pages
    hxs = HtmlXPathSelector(response)
    curr = int(response.request.meta['curr'])
    total = int(response.request.meta['total'])
    flag = response.request.meta['flag']
    self.pro_urls.extend(hxs.select(
        "//td[@class='name']//a[contains(@id,'ctl00_ctl00_ContentPlaceHolder1_ItemListPlaceHolder_itemslv_ctrl')]/@href"
    ).extract())
    if flag:
        total = hxs.select(
            "//span[@id='ctl00_ctl00_ContentPlaceHolder1_ItemListPlaceHolder_lbl_pagesizeBtm']/text()").re('\d+')[0]
    if curr < total:
        curr += 1
        yield FormRequest.from_response(response,
            formdata={'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$pagesddl': '50',
                      'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$sortddl': 'Price(ASC)',
                      'ctl00$ctl00$ScriptManager1': 'ctl00$ctl00$ScriptManager1|ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$pager1$ctl00$ctl01'
                , '__EVENTTARGET': 'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$pager1$ctl00$ctl01',
                      'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$hfVSFileName': hxs.select(
                          ".//input[@id='ctl00_ctl00_ContentPlaceHolder1_ItemListPlaceHolder_hfVSFileName']/@value").extract()[
                                                                                          0]},
            meta={'curr': curr, 'total': total, 'flag': False},
            dont_click=True,
            callback=self.page_rs_50
        )
    else:
        for pro in self.pro_urls:
            yield Request("http://www.e3050.com/%s" % pro,
                callback=self.parse_product)


def parse_product(self, response):
    pass
    #TODO Implementation Required For Parsing
Error: User Rate Limit Exceeded
Error: User Rate Limit Exceeded Mahmoud M. Abdel-Fattah
1

        # Get last page number
        last_page = hxs.select('//span[@id="ctl00_ctl00_ContentPlaceHolder1_ItemListPlaceHolder_lbl_PageSize"]/text()').extract()[0]
        i = 1

        # preparing requests for each page
        while i < (int(last_page) / 5)  + 1:
            requests.append(Request("http://www.e3050.com/Items.aspx?cat=SON", callback=self.parse_product))
            i +=1 

Во-первых, вместо этих манипуляций сi, ты можешь сделать:

        for i in xrange(1, last_page // 5 + 1):

Тогда вы делаете:

            requests.append(Request("http://www.e3050.com/Items.aspx?cat=SON", callback=self.parse_product))

Вы создаете много запросов на один и тот же URL?

Error: User Rate Limit ExceededFormRequestError: User Rate Limit ExceededRequestError: User Rate Limit Exceeded
Error: User Rate Limit Exceeded Mahmoud M. Abdel-Fattah
Request("http://www.e3050.com/Items.aspx?cat=SON", callback=self.parse_product)Error: User Rate Limit ExceedediError: User Rate Limit Exceeded
Error: User Rate Limit Exceededrequests.append(Request("http://www.e3050.com/Items.aspx?cat=SON&page=%s" % i, callback=self.parse_product))Error: User Rate Limit Exceeded Mahmoud M. Abdel-Fattah
Error: User Rate Limit Exceededpastebin.com/aEF6bSWKError: User Rate Limit Exceeded Mahmoud M. Abdel-Fattah

Похожие вопросы