4

Вопрос по nutch, web-crawler, search-engine – Как сохранить исходный HTML-файл с Apache Nutch

Error: User Rate Limit Exceeded

Error: User Rate Limit Exceeded

  • Error: User Rate Limit Exceeded

    от Freedom
  • Error: User Rate Limit Exceeded

    от
  • Error: User Rate Limit Exceeded

    от Freedom
5 ответов
  • 0

    Error: User Rate Limit Exceeded

    nutch dump.

  • 6

    Error: User Rate Limit Exceeded

    Nutch in Eclipse.

    case ProtocolStatus.SUCCESS:        // got a page
                pstatus = output(fit.url, fit.datum, content, status, CrawlDatum.STATUS_FETCH_SUCCESS, fit.outlinkDepth);
                updateStatus(content.getContent().length);'
    
    
                //------------------------------------------- content saver ---------------------------------------------\\
                String filename = "savedsites//" + content.getUrl().replace('/', '-');  
    
                File file = new File(filename);
                file.getParentFile().mkdirs();
                boolean exist = file.createNewFile();
                if (!exist) {
                    System.out.println("File exists.");
                } else {
                    FileWriter fstream = new FileWriter(file);
                    BufferedWriter out = new BufferedWriter(fstream);
                    out.write(content.toString().substring(content.toString().indexOf("<!DOCTYPE html")));
                    out.close();
                    System.out.println("File created successfully.");
                }
                //------------------------------------------- content saver ---------------------------------------------\\
    

  • 5

    Error: User Rate Limit Exceeded

        Configuration conf = NutchConfiguration.create();
        FileSystem fs = FileSystem.get(conf);
    
        Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
        SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
    
        try
        {
                Text key = new Text();
                Content content = new Content();
    
                while (reader.next(key, content)) 
                {
                        System.out.println(new String(content.GetContent()));
                }
        }
        catch (Exception e)
        {
    
        }
    

  • 0

    Error: User Rate Limit Exceeded


    if (content != null) {
    ByteBuffer raw = fit.page.getContent();
    if (raw != null) {
        ByteArrayInputStream arrayInputStream = new ByteArrayInputStream(raw.array(), raw.arrayOffset() + raw.position(), raw.remaining());
        Scanner scanner = new Scanner(arrayInputStream);
        scanner.useDelimiter("\\Z");//To read all scanner content in one String
        String data = "";
        if (scanner.hasNext()) {
            data = scanner.next();
        }
        fit.page.setReprUrl(StringUtil.cleanField(data));
        scanner.close();
    }
    

  • 9

    Error: User Rate Limit Exceeded

    If the list of pages/urls that you intend to have is quite low, then better get it done with a script which invokes wget for each url. OR use HTTrack tool.

    EDIT:

    Few pointers for helping your initiative:

    pstatus = output(fit.url, fit.datum, content, status, CrawlDatum.STATUS_FETCH_SUCCESS);
    updateStatus(content.getContent().length);
    

    contentcontent.getContent()fit.url.