Вопрос по search-engine, web-crawler, nutch – Как сохранить исходный HTML-файл с Apache Nutch

4

Error: User Rate Limit Exceeded

Error: User Rate Limit Exceeded

Ваш Ответ

5   ответов
0

nutch dump.

9

If the list of pages/urls that you intend to have is quite low, then better get it done with a script which invokes wget for each url. OR use HTTrack tool.

EDIT:

Few pointers for helping your initiative:

pstatus = output(fit.url, fit.datum, content, status, CrawlDatum.STATUS_FETCH_SUCCESS);
updateStatus(content.getContent().length);

contentcontent.getContent()fit.url.

Error: User Rate Limit Exceeded Freedom
Error: User Rate Limit Exceeded
Error: User Rate Limit Exceeded Freedom
5

    Configuration conf = NutchConfiguration.create();
    FileSystem fs = FileSystem.get(conf);

    Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
    SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);

    try
    {
            Text key = new Text();
            Content content = new Content();

            while (reader.next(key, content)) 
            {
                    System.out.println(new String(content.GetContent()));
            }
    }
    catch (Exception e)
    {

    }
0


if (content != null) {
ByteBuffer raw = fit.page.getContent();
if (raw != null) {
    ByteArrayInputStream arrayInputStream = new ByteArrayInputStream(raw.array(), raw.arrayOffset() + raw.position(), raw.remaining());
    Scanner scanner = new Scanner(arrayInputStream);
    scanner.useDelimiter("\\Z");//To read all scanner content in one String
    String data = "";
    if (scanner.hasNext()) {
        data = scanner.next();
    }
    fit.page.setReprUrl(StringUtil.cleanField(data));
    scanner.close();
}
6

Nutch in Eclipse.

case ProtocolStatus.SUCCESS:        // got a page
            pstatus = output(fit.url, fit.datum, content, status, CrawlDatum.STATUS_FETCH_SUCCESS, fit.outlinkDepth);
            updateStatus(content.getContent().length);'


            //------------------------------------------- content saver ---------------------------------------------\\
            String filename = "savedsites//" + content.getUrl().replace('/', '-');  

            File file = new File(filename);
            file.getParentFile().mkdirs();
            boolean exist = file.createNewFile();
            if (!exist) {
                System.out.println("File exists.");
            } else {
                FileWriter fstream = new FileWriter(file);
                BufferedWriter out = new BufferedWriter(fstream);
                out.write(content.toString().substring(content.toString().indexOf("<!DOCTYPE html")));
                out.close();
                System.out.println("File created successfully.");
            }
            //------------------------------------------- content saver ---------------------------------------------\\

Похожие вопросы