有没有办法将采集附件插件与 Elastic App Search 一起使用

Question

答案是，按照建议利用附件管道提取附件内容在这篇博文中或者，如果你的后端像我一样使用 Java，你可以使用阿帕奇蒂卡自己从附件中提取内容。

我实现了 Tika 来提取 HTML 内容（实际上非常简单）

static String getContent(String htmlContent) throws TikaException, SAXException, IOException {
    InputStream input = new ByteArrayInputStream(htmlContent.getBytes());
    ContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    new HtmlParser().parse(input, handler, metadata, new ParseContext());
    return handler.toString();
}

对于 PDF 文件，我已经使用 Apache PdfBox 提取一些其他属性，因此文本是“免费的”。对于 Office 文件也是如此，但这需要 Apache Poi。

Answer 1

答案是，按照建议利用附件管道提取附件内容在这篇博文中或者，如果你的后端像我一样使用 Java，你可以使用阿帕奇蒂卡自己从附件中提取内容。

我实现了 Tika 来提取 HTML 内容（实际上非常简单）

static String getContent(String htmlContent) throws TikaException, SAXException, IOException {
    InputStream input = new ByteArrayInputStream(htmlContent.getBytes());
    ContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    new HtmlParser().parse(input, handler, metadata, new ParseContext());
    return handler.toString();
}

对于 PDF 文件，我已经使用 Apache PdfBox 提取一些其他属性，因此文本是“免费的”。对于 Office 文件也是如此，但这需要 Apache Poi。

有没有办法将采集附件插件与 Elastic App Search 一起使用

答案1

相关内容