Using simple Regex and String.replaceAll() function to easily remove HTML from String:
We can use a simple JAVA regex to remove all HTML tags from a string.
The demerit of this is that it is difficult to selectively remove HTML tags. For example if we want to remove all HTML tags except the <span> and <br> tag it will increase the complexity of the regex expression.
package net.codermag.example;
public class ConvertHTML {
public static void main(String[] args) {
String text = "<div><span><b style='color:blue;'>CoderMagnet:</b>The Developer playground.</span></div>";
System.out.println(text.replaceAll("\\<[^>]*>", ""));
}
}
Output:
CoderMagnet:The Developer playground.
Pitfall:
package net.codermag.example;
public class ConvertHTML {
public static void main(String[] args) {
String text = "<div><span><br>CoderMagnet:<br/>The <p>Developer</p> playground.</span></div>";
// Replacing the <br> and <p> with newlines
text = text.replaceAll("<br>", "\n").replaceAll("<br/>", "\n").replaceAll("</br>", "\n");
text = text.replaceAll("<p>", "\n");
text = text.replaceAll("\\<[^>]*>", "");
System.out.println(text);
}
}
Output:
CoderMagnet:
The
Developer playground.
Please note that any malformed HTML might cause problems. So please watch out during your daily development scenarios.