Class CharSetUtils

java.lang.Object
org.tquadrat.foundation.util.CharSetUtils

@ClassVersion(sourceVersion="$Id: CharSetUtils.java 1060 2023-09-24 19:21:40Z tquadrat $") @API(status=STABLE, since="0.1.0") @UtilityClass public final class CharSetUtils extends Object
This class provides several utilities dealing with Strings in different character sets/encodings.
Author:
Thomas Thrien (thomas.thrien@tquadrat.org)
Version:
CharSetUtils: HexUtils.java 747 2020-12-01 12:40:38Z tquadrat $
Since:
0.1.0
UML Diagram
UML Diagram for "org.tquadrat.foundation.util.CharSetUtils"

UML Diagram for "org.tquadrat.foundation.util.CharSetUtils"

UML Diagram for "org.tquadrat.foundation.util.CharSetUtils"
  • Constructor Summary

    Constructors
    Modifier
    Constructor
    Description
    private
    No instance allowed for this class!
  • Method Summary

    Modifier and Type
    Method
    Description
    static final String
    convertBytesToASCII(byte[] bytes)
    Converts the given byte array into to a String that will only contain printable ASCII characters; all other characters will be 'escaped' to the format "\uXXXX".
    static final String
    Converts a String that contains only ASCII characters and Unicode escape sequences like "\uXXXX" to the equivalent Unicode String.

    This method will not touch other escape sequences, like "\n" or "\t".
    static final String
    Translates the given Unicode String without any normalisation to a String that will only contain printable ASCII characters; all other characters will be 'escaped' to the format "\uXXXX".
    static final String
    Applies the given normalisation to the given Unicode String and translates it to a String that will only contain printable ASCII characters; all other characters will be 'escaped' to the format "\uXXXX".
    static final String
    Returns the Unicode escape sequence for the given character.
    static final String
    escapeCharacter(int codePoint)
    Returns the Unicode escape sequence for the given code point.
    private static final int
    Extracts the escape sequence from the given chunk, write the result to the buffer and returns the offset.
    static final boolean
    Returns true if the given character is an ASCII character.
    static final boolean
    isASCIICharacter(int codePoint)
    Returns true if the given code point represents an ASCII character.
    static final boolean
    Returns true if the given character is a printable ASCII character.
    static final boolean
    Returns true if the given code point represents a printable ASCII character.
    static final String
    Parses Strings in the format "\uXXXX", containing the textual representation of a single Unicode character, to the respective Unicode character.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • CharSetUtils

      private CharSetUtils()
      No instance allowed for this class!
  • Method Details

    • convertBytesToASCII

      @API(status=STABLE, since="0.1.0") public static final String convertBytesToASCII(byte[] bytes)
      Converts the given byte array into to a String that will only contain printable ASCII characters; all other characters will be 'escaped' to the format "\uXXXX". This can be useful to generate a String in another character set/encoding than ASCII or UTF-8/Unicode, given that the receiving part can interpret the format.

      But generally, a transfer encoding like BASE64 or quoted-printable should be preferred.
      Parameters:
      bytes - The input; may be null.
      Returns:
      The output string; null if the input was already null.
      Since:
      0.1.0
    • convertEscapedStringToUnicode

      Converts a String that contains only ASCII characters and Unicode escape sequences like "\uXXXX" to the equivalent Unicode String.

      This method will not touch other escape sequences, like "\n" or "\t". Refer to String.translateEscapes().
      Parameters:
      input - The input String; may be null.
      Returns:
      The output string; null if the input string was already null.
      Throws:
      IllegalArgumentException - The given input String contained at least one non-ASCII character.
      Since:
      0.1.0
    • convertUnicodeToASCII

      @API(status=STABLE, since="0.1.0") public static final String convertUnicodeToASCII(Normalizer.Form normalization, CharSequence input)
      Applies the given normalisation to the given Unicode String and translates it to a String that will only contain printable ASCII characters; all other characters will be 'escaped' to the format "\uXXXX".
      Parameters:
      normalization - The normalisation form; in case it is null, no normalisation will be performed.
      input - The input String; may be null.
      Returns:
      The output String; null if the input String was already null.
      Since:
      0.1.0
    • convertUnicodeToASCII

      @API(status=STABLE, since="0.1.0") public static final String convertUnicodeToASCII(CharSequence input)
      Translates the given Unicode String without any normalisation to a String that will only contain printable ASCII characters; all other characters will be 'escaped' to the format "\uXXXX". Calling this method is the same as calling convertUnicodeToASCII(Normalizer.Form, CharSequence) with null as the first argument.
      Parameters:
      input - The input String; may be null.
      Returns:
      The output String; null if the input String was already null.
      Since:
      0.1.0
    • escapeCharacter

      @API(status=STABLE, since="0.1.0") public static final String escapeCharacter(char c)
      Returns the Unicode escape sequence for the given character. This will return "&#92;u0075" for the letter 'u', and "&#92;u003c" for the smaller-than sign '<'.

      This method should be used only for characters that are not surrogates; for general use, the implementation that takes a code point is preferred.
      Parameters:
      c - The character.
      Returns:
      The escape sequence.
      Since:
      0.1.0
      See Also:
    • escapeCharacter

      @API(status=STABLE, since="0.1.0") public static final String escapeCharacter(int codePoint) throws IllegalArgumentException
      Returns the Unicode escape sequence for the given code point. This will return "&#92;u0075" for the letter 'u', and "&#92;u003c" for the smaller-than sign '<'.

      This method takes only a single code point; to translate a whole String, this code sequence can be used:
        …
        String result = input.codePoints()
            .mapToObj( codePoint -> escapeUnicode( codePoint ) )
            .collect( Collectors.joining() );
        …
      This will escape all characters in the String. If only a subset needs to be escaped, the mapping function in mapToObj() can be adjusted accordingly. Something like that is implemented with the method convertUnicodeToASCII(CharSequence).
      Parameters:
      codePoint - The character.
      Returns:
      The escape sequence.
      Throws:
      IllegalArgumentException - The given code point is invalid.
      Since:
      0.1.0
      See Also:
    • isASCIICharacter

      public static final boolean isASCIICharacter(char c)
      Returns true if the given character is an ASCII character.
      Parameters:
      c - The character to check.
      Returns:
      true if the given character is an ASCII character, false otherwise.
    • isASCIICharacter

      public static final boolean isASCIICharacter(int codePoint)
      Returns true if the given code point represents an ASCII character.
      Parameters:
      codePoint - The code point to check.
      Returns:
      true if the given code point represents an ASCII character, false otherwise.
    • isPrintableASCIICharacter

      public static final boolean isPrintableASCIICharacter(char c)
      Returns true if the given character is a printable ASCII character. That means, it is an ASCII character, but not a control character.
      Parameters:
      c - The character to check.
      Returns:
      true if the given character is a printable ASCII character, false otherwise.
    • isPrintableASCIICharacter

      public static final boolean isPrintableASCIICharacter(int codePoint)
      Returns true if the given code point represents a printable ASCII character. That means, it is an ASCII character, but not a control character.
      Parameters:
      codePoint - The code point to check.
      Returns:
      true if the given code point represents a printable ASCII character, false otherwise.
    • extractEscapeSequence

      private static final int extractEscapeSequence(StringBuilder buffer, Pattern pattern, CharSequence chunk)
      Extracts the escape sequence from the given chunk, write the result to the buffer and returns the offset.
      Parameters:
      buffer - The target buffer.
      pattern - The regex pattern for the check.
      chunk - The chunk to check.
      Returns:
      The offset; one of 1, 6, or 12.
    • unescapeUnicode

      @API(status=STABLE, since="0.1.5") public static final String unescapeUnicode(CharSequence input)
      Parses Strings in the format "\uXXXX", containing the textual representation of a single Unicode character, to the respective Unicode character. Some Unicode characters will be represented as surrogate pairs in Java, so the String that is returned by this method may contain more than one char.

      The input format for this method is used in Java source code Strings, in Java .properties files, in C/C++ source code, in JavaScript source, …
      Parameters:
      input - The input String with the Unicode escape sequence.
      Returns:
      The Unicode character.
      Throws:
      ValidationException - The input is null, empty, or cannot be parsed as a unicode escape sequence.
      Since:
      0.1.5