I’m putting this Scala shell script out here as a “source code snippet” so I can find it again if I need it. This file reads an input file that contains a series of HTML <h1>
tags. I use this as part of a process of publishing an Amazon Kindle ebook from an HTML file, and in one of the steps of the creation process, I use this script to help create the Table of Contents (TOC) for the book.
Here’s the source code:
#!/bin/sh
exec scala -classpath ".:lib/htmlcleaner-2.2.jar:lib/commons-lang3-3.1.jar" -savecompiled "$0" "$@"
!#
import org.htmlcleaner.HtmlCleaner
import org.apache.commons.lang3.StringEscapeUtils
import scala.io.StdIn
val INPUT_FILE = "h1tags.html"
def readFile(filename: String): Seq[String] = {
val bufferedSource = io.Source.fromFile(filename)
val lines = (for (line <- bufferedSource.getLines()) yield line).toList
bufferedSource.close
lines
}
val lines = readFile(INPUT_FILE)
val html = lines.mkString("\n")
val cleaner = new HtmlCleaner
val rootNode = cleaner.clean(html)
val h1tags = rootNode.getElementsByName("h1", true)
val h1Seq = for {
e <- h1tags
idAttr = e.getAttributeByName("id")
if idAttr != null
idText = StringEscapeUtils.unescapeHtml4(e.getText.toString.trim)
} yield (idAttr, idText)
h1Seq.foreach { e =>
println(s"""<li><a href="#${e._1}">${e._2}</a><br></li>""")
}
FWIW, the file that contains the <h1>
tags has entries that look like this:
<h1 id="copyright" class="unnumbered">Copyright</h1>
<h1 id="introduction-or-why-i-wrote-this-book">Introduction(or, Why I Wrote This Book)</h1>
<h1 id="who-this-book-is-for">Who This Book is For</h1>
Knowing that the file h1tags.html contains only <h1>
tags, here’s how the Scala script works: Basically, it loops over each line in the input file and extracts the #id
and the description from each <h1>
tag. It then outputs an HTML <li>
tag for each <h1>
tag, and that output is used by another shell script that wraps this one to create a TOC.
As a final note, that code uses the HtmlCleaner and Apache Commons-Lang libraries to do what it does, as implied by the classpath
entry at the top of the script.