Open
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #26, closes #25, closes #14.
To build, a JDK 11+ is required. To build the Javadoc, a JDK 15+ is required (otherwise it fails on the
xommodule. There's already an RC available at http://jdk.java.net/15/, and the final release is due next month).The commits should be self-explanatory.
W.r.t. the cleanup: which of these commits must be reverted, and what can effectively be removed (also considering that it's trivial to resurrect anything from a Git repo)?
Edit: to clarify: before starting out with this PR, I deleted everything that isn't required to build the final JARs, purely for my own convenience. Obviously, I'm aware that some of it needs to stay. However, given the reorganization of the project's structure, I believe it would be useful to go over these commits, and decide what to do with each of them:
gwt-srcseems to have been added at the time in order to enable HTML parsing in env.js. However,env.jsis effectively dead, and GWT itself is mostly irrelevant as well nowadays)htmlparser,saxtree,xom). So to avoid clutter in the top-level directory, I believe it would be best to move any other remaining top-level directories elsewhere, typically into the newhtmlparsermodule's directory (e.g.translator-srcwould move tohtmlparser/src/translator).So for the record: I'm not putting any of this up for debate: if you say "revert all commits as is", then I'll do so without further ado.
W.r.t. "Make jchardet & ICU4J heuristics a no-op":
jchardetis problematic. When relying on its automatic module name, Maven gives the following warning:So I really believe we should eliminate the dependency altogether. The easiest way to do so, is by simply making the related heuristic a no-op. In fact, I believe the same should be done with the
ICU4Jdependency. The usage of optional Maven dependencies andrequires staticclauses are indications that the current situation isn't quite right.The current commit is the minimal change required to eliminate the dependencies.
Now there are 3 options:
remove
Heuristicsaltogether, given thatInputSourcedoesn't wrap aReader, and both of the basic sniffers (Bom&Meta) failuse an SPI:
htmlparsershould introduce an interfaceEncodingSnifferand itsmodule-info.javashould sayuses EncodingSniffer;. Then the current sniffers should be moved into their own Maven module(s), implementEncodingSniffer, and be made available as services tohtmlparsermerge as is: this is no more than a behavioral regression (except for the removal of the sniffers, but I don't see how direct usage of these classes could possibly be justified), so we could just wait for someone to report it. Then once (if ever) it happens, one of the former 2 options can be implemented
What's your opinion?
(Edit: @carlosame I edited this post to clarify the "Clean up"-commits. And contrary to what you're saying, normalization checking no longer requires ICU4J. So please remove any comments that are not/no longer relevant, or at least bundle all your comments into a single one.)
(Edit2: @carlosame ICU4J is over 30x the size of htmlparser, and both its encoding detection (which "isn't very good") and normalization checking are disabled by default. Anyway, I get your point. And again: 3 of your comments are irrelevant at this point, and the other 2 could easily be merged into 1. So please clean up your existing comments & stop adding new comments (just to be clear: add as much feedback as you want, but do so by editing an existing comment, not adding new ones all the time).)