Parsing, Tokenizing, and Formatting


  • regex is short for regular expressions, which are the patterns used to search for data within large data sources.
  • regex is a sub-language that exists in Java and other languages (such as Perl).
  • regex lets you to create search patterns using literal characters or metacharacters. Metacharacters allow you to search for slightly more abstract data like "digits" or "whitespace". Study the \d, \s, \w, and . metacharacters
  • regex provides for quantifiers which allow you to specify concepts like: "look for one or more digits in a row."
  • Study the ?, *, and + greedy quantifiers.
  • Remember that metacharacters and Strings don't mix well unless you remember to "escape" them properly. For instance String s = "\\d";
  • The Pattern and Matcher classes have Java's most powerful regex capabilities.
  • You should understand the Pattern compile() method and the Matcher matches(), pattern(), find(), start(), and group() methods.
  • You WON'T need to understand Matcher's replacement-oriented methods.
  • You can use java.util.Scanner to do simple regex searches, but it is primarily intended for tokenizing.
  • Tokenizing is the process of splitting delimited data into small pieces.
  • In tokenizing, the data you want is called tokens, and the strings that separate the tokens are called delimiters.
  • Tokenizing can be done with the Scanner class, or with String.split().
  • Delimiters are single characters like commas, or complex regex expressions.
  • The Scanner class allows you to tokenize data from within a loop, which allows you to stop whenever you want to.
  • The Scanner class allows you to tokenize Strings or streams or files.
  • The String.split() method tokenizes the entire source data all at once, so large amounts of data can be quite slow to process.
  • New to Java 5 are two methods used to format data for output. These methods are format() and printf(). These methods are found in the PrintStream class, an instance of which is the out in System.out.
  • The format() and printf() methods have identical functionality.
  • Formatting data with printf() (or format()) is accomplished using formatting strings that are associated with primitive or string arguments.
  • The format() method allows you to mix literals in with your format strings.
  • The format string values you should know are
  • Flags: -, +, 0, "," , and (
  • Conversions: b, c, d, f, and s
  • If your conversion character doesn't match your argument type, an exception will be thrown.