Simple XML Subset Parser

Simple XML Subset Parser — parses a subset of XML

Synopsis

#include <glib.h>

enum                GMarkupError;
#define             G_MARKUP_ERROR
enum                GMarkupParseFlags;
                    GMarkupParseContext;
                    GMarkupParser;
gchar*              g_markup_escape_text                (const gchar *text,
                                                         gssize length);
gchar *             g_markup_printf_escaped             (const char *format,
                                                         ...);
gchar *             g_markup_vprintf_escaped            (const char *format,
                                                         va_list args);
gboolean            g_markup_parse_context_end_parse    (GMarkupParseContext *context,
                                                         GError **error);
void                g_markup_parse_context_free         (GMarkupParseContext *context);
void                g_markup_parse_context_get_position (GMarkupParseContext *context,
                                                         gint *line_number,
                                                         gint *char_number);
const gchar *       g_markup_parse_context_get_element  (GMarkupParseContext *context);
const GSList *      g_markup_parse_context_get_element_stack
                                                        (GMarkupParseContext *context);
gpointer            g_markup_parse_context_get_user_data
                                                        (GMarkupParseContext *context);
GMarkupParseContext * g_markup_parse_context_new        (const GMarkupParser *parser,
                                                         GMarkupParseFlags flags,
                                                         gpointer user_data,
                                                         GDestroyNotify user_data_dnotify);
gboolean            g_markup_parse_context_parse        (GMarkupParseContext *context,
                                                         const gchar *text,
                                                         gssize text_len,
                                                         GError **error);
void                g_markup_parse_context_push         (GMarkupParseContext *context,
                                                         GMarkupParser *parser,
                                                         gpointer user_data);
gpointer            g_markup_parse_context_pop          (GMarkupParseContext *context);

enum                GMarkupCollectType;
gboolean            g_markup_collect_attributes         (const gchar *element_name,
                                                         const gchar **attribute_names,
                                                         const gchar **attribute_values,
                                                         GError **error,
                                                         GMarkupCollectType first_type,
                                                         const gchar *first_attr,
                                                         ...);

Description

The "GMarkup" parser is intended to parse a simple markup format that's a subset of XML. This is a small, efficient, easy-to-use parser. It should not be used if you expect to interoperate with other applications generating full-scale XML. However, it's very useful for application data files, config files, etc. where you know your application will be the only one writing the file. Full-scale XML parsers should be able to parse the subset used by GMarkup, so you can easily migrate to full-scale XML at a later time if the need arises.

GMarkup is not guaranteed to signal an error on all invalid XML; the parser may accept documents that an XML parser would not. However, XML documents which are not well-formed[5] are not considered valid GMarkup documents.

Simplifications to XML include:

  • Only UTF-8 encoding is allowed.

  • No user-defined entities.

  • Processing instructions, comments and the doctype declaration are "passed through" but are not interpreted in any way.

  • No DTD or validation.

The markup format does support:

  • Elements

  • Attributes

  • 5 standard entities: &amp; &lt; &gt; &quot; &apos;

  • Character references

  • Sections marked as CDATA

Details

enum GMarkupError

typedef enum
{
  G_MARKUP_ERROR_BAD_UTF8,
  G_MARKUP_ERROR_EMPTY,
  G_MARKUP_ERROR_PARSE,
  /* The following are primarily intended for specific GMarkupParser
   * implementations to set.
   */
  G_MARKUP_ERROR_UNKNOWN_ELEMENT,
  G_MARKUP_ERROR_UNKNOWN_ATTRIBUTE,
  G_MARKUP_ERROR_INVALID_CONTENT,
  G_MARKUP_ERROR_MISSING_ATTRIBUTE
} GMarkupError;

Error codes returned by markup parsing.

G_MARKUP_ERROR_BAD_UTF8

text being parsed was not valid UTF-8

G_MARKUP_ERROR_EMPTY

document contained nothing, or only whitespace

G_MARKUP_ERROR_PARSE

document was ill-formed

G_MARKUP_ERROR_UNKNOWN_ELEMENT

error should be set by GMarkupParser functions; element wasn't known

G_MARKUP_ERROR_UNKNOWN_ATTRIBUTE

error should be set by GMarkupParser functions; attribute wasn't known

G_MARKUP_ERROR_INVALID_CONTENT

error should be set by GMarkupParser functions; content was invalid

G_MARKUP_ERROR_MISSING_ATTRIBUTE

error should be set by GMarkupParser functions; a required attribute was missing

G_MARKUP_ERROR

#define G_MARKUP_ERROR g_markup_error_quark ()

Error domain for markup parsing. Errors in this domain will be from the GMarkupError enumeration. See GError for information on error domains.


enum GMarkupParseFlags

typedef enum
{
  G_MARKUP_DO_NOT_USE_THIS_UNSUPPORTED_FLAG = 1 << 0,
  G_MARKUP_TREAT_CDATA_AS_TEXT              = 1 << 1,
  G_MARKUP_PREFIX_ERROR_POSITION            = 1 << 2
} GMarkupParseFlags;

Flags that affect the behaviour of the parser.

G_MARKUP_DO_NOT_USE_THIS_UNSUPPORTED_FLAG

flag you should not use.

G_MARKUP_TREAT_CDATA_AS_TEXT

When this flag is set, CDATA marked sections are not passed literally to the passthrough function of the parser. Instead, the content of the section (without the <![CDATA[ and ]]>) is passed to the text function. This flag was added in GLib 2.12.

G_MARKUP_PREFIX_ERROR_POSITION

Normally errors caught by GMarkup itself have line/column information prefixed to them to let the caller know the location of the error. When this flag is set the location information is also prefixed to errors generated by the GMarkupParser implementation functions.

GMarkupParseContext

typedef struct _GMarkupParseContext GMarkupParseContext;

A parse context is used to parse a stream of bytes that you expect to contain marked-up text. See g_markup_parse_context_new(), GMarkupParser, and so on for more details.


GMarkupParser

typedef struct {
  /* Called for open tags <foo bar="baz"> */
  void (*start_element)  (GMarkupParseContext *context,
                          const gchar         *element_name,
                          const gchar        **attribute_names,
                          const gchar        **attribute_values,
                          gpointer             user_data,
                          GError             **error);

  /* Called for close tags </foo> */
  void (*end_element)    (GMarkupParseContext *context,
                          const gchar         *element_name,
                          gpointer             user_data,
                          GError             **error);

  /* Called for character data */
  /* text is not nul-terminated */
  void (*text)           (GMarkupParseContext *context,
                          const gchar         *text,
                          gsize                text_len,  
                          gpointer             user_data,
                          GError             **error);

  /* Called for strings that should be re-saved verbatim in this same
   * position, but are not otherwise interpretable.  At the moment
   * this includes comments and processing instructions.
   */
  /* text is not nul-terminated. */
  void (*passthrough)    (GMarkupParseContext *context,
                          const gchar         *passthrough_text,
                          gsize                text_len,  
                          gpointer             user_data,
                          GError             **error);

  /* Called on error, including one set by other
   * methods in the vtable. The GError should not be freed.
   */
  void (*error)          (GMarkupParseContext *context,
                          GError              *error,
                          gpointer             user_data);
} GMarkupParser;

Any of the fields in GMarkupParser can be NULL, in which case they will be ignored. Except for the error function, any of these callbacks can set an error; in particular the G_MARKUP_ERROR_UNKNOWN_ELEMENT, G_MARKUP_ERROR_UNKNOWN_ATTRIBUTE, and G_MARKUP_ERROR_INVALID_CONTENT errors are intended to be set from these callbacks. If you set an error from a callback, g_markup_parse_context_parse() will report that error back to its caller.

start_element ()

Callback to invoke when the opening tag of an element is seen.

end_element ()

Callback to invoke when the closing tag of an element is seen. Note that this is also called for empty tags like <empty/>.

text ()

Callback to invoke when some text is seen (text is always inside an element). Note that the text of an element may be spread over multiple calls of this function. If the G_MARKUP_TREAT_CDATA_AS_TEXT flag is set, this function is also called for the content of CDATA marked sections.

passthrough ()

Callback to invoke for comments, processing instructions and doctype declarations; if you're re-writing the parsed document, write the passthrough text back out in the same position. If the G_MARKUP_TREAT_CDATA_AS_TEXT flag is not set, this function is also called for CDATA marked sections.

error ()

Callback to invoke when an error occurs.

g_markup_escape_text ()

gchar*              g_markup_escape_text                (const gchar *text,
                                                         gssize length);

Escapes text so that the markup parser will parse it verbatim. Less than, greater than, ampersand, etc. are replaced with the corresponding entities. This function would typically be used when writing out a file to be parsed with the markup parser.

Note that this function doesn't protect whitespace and line endings from being processed according to the XML rules for normalization of line endings and attribute values.

Note also that if given a string containing them, this function will produce character references in the range of &x1; .. &x1f; for all control sequences except for tabstop, newline and carriage return. The character references in this range are not valid XML 1.0, but they are valid XML 1.1 and will be accepted by the GMarkup parser.

text :

some valid UTF-8 text

length :

length of text in bytes, or -1 if the text is nul-terminated

Returns :

a newly allocated string with the escaped text

g_markup_printf_escaped ()

gchar *             g_markup_printf_escaped             (const char *format,
                                                         ...);

Formats arguments according to format, escaping all string and character arguments in the fashion of g_markup_escape_text(). This is useful when you want to insert literal strings into XML-style markup output, without having to worry that the strings might themselves contain markup.

1
2
3
4
5
6
7
8
9
const char *store = "Fortnum & Mason";
const char *item = "Tea";
char *output;
 
output = g_markup_printf_escaped ("<purchase>"
                                  "<store>%s</store>"
                                  "<item>%s</item>"
                                  "</purchase>",
                                  store, item);

format :

printf() style format string

... :

the arguments to insert in the format string

Returns :

newly allocated result from formatting operation. Free with g_free().

Since 2.4


g_markup_vprintf_escaped ()

gchar *             g_markup_vprintf_escaped            (const char *format,
                                                         va_list args);

Formats the data in args according to format, escaping all string and character arguments in the fashion of g_markup_escape_text(). See g_markup_printf_escaped().

format :

printf() style format string

args :

variable argument list, similar to vprintf()

Returns :

newly allocated result from formatting operation. Free with g_free().

Since 2.4


g_markup_parse_context_end_parse ()

gboolean            g_markup_parse_context_end_parse    (GMarkupParseContext *context,
                                                         GError **error);

Signals to the GMarkupParseContext that all data has been fed into the parse context with g_markup_parse_context_parse(). This function reports an error if the document isn't complete, for example if elements are still open.

context :

a GMarkupParseContext

error :

return location for a GError

Returns :

TRUE on success, FALSE if an error was set

g_markup_parse_context_free ()

void                g_markup_parse_context_free         (GMarkupParseContext *context);

Frees a GMarkupParseContext. Can't be called from inside one of the GMarkupParser functions. Can't be called while a subparser is pushed.

context :

a GMarkupParseContext

g_markup_parse_context_get_position ()

void                g_markup_parse_context_get_position (GMarkupParseContext *context,
                                                         gint *line_number,
                                                         gint *char_number);

Retrieves the current line number and the number of the character on that line. Intended for use in error messages; there are no strict semantics for what constitutes the "current" line number other than "the best number we could come up with for error messages."

context :

a GMarkupParseContext

line_number :

return location for a line number, or NULL

char_number :

return location for a char-on-line number, or NULL

g_markup_parse_context_get_element ()

const gchar *       g_markup_parse_context_get_element  (GMarkupParseContext *context);

Retrieves the name of the currently open element.

If called from the start_element or end_element handlers this will give the element_name as passed to those functions. For the parent elements, see g_markup_parse_context_get_element_stack().

context :

a GMarkupParseContext

Returns :

the name of the currently open element, or NULL

Since 2.2


g_markup_parse_context_get_element_stack ()

const GSList *      g_markup_parse_context_get_element_stack
                                                        (GMarkupParseContext *context);

Retrieves the element stack from the internal state of the parser. The returned GSList is a list of strings where the first item is the currently open tag (as would be returned by g_markup_parse_context_get_element()) and the next item is its immediate parent.

This function is intended to be used in the start_element and end_element handlers where g_markup_parse_context_get_element() would merely return the name of the element that is being processed.

context :

a GMarkupParseContext

Returns :

the element stack, which must not be modified

Since 2.16


g_markup_parse_context_get_user_data ()

gpointer            g_markup_parse_context_get_user_data
                                                        (GMarkupParseContext *context);

Returns the user_data associated with context. This will either be the user_data that was provided to g_markup_parse_context_new() or to the most recent call of g_markup_parse_context_push().

context :

a GMarkupParseContext

Returns :

the provided user_data. The returned data belongs to the markup context and will be freed when g_markup_context_free() is called.

Since 2.18


g_markup_parse_context_new ()

GMarkupParseContext * g_markup_parse_context_new        (const GMarkupParser *parser,
                                                         GMarkupParseFlags flags,
                                                         gpointer user_data,
                                                         GDestroyNotify user_data_dnotify);

Creates a new parse context. A parse context is used to parse marked-up documents. You can feed any number of documents into a context, as long as no errors occur; once an error occurs, the parse context can't continue to parse text (you have to free it and create a new parse context).

parser :

a GMarkupParser

flags :

one or more GMarkupParseFlags

user_data :

user data to pass to GMarkupParser functions

user_data_dnotify :

user data destroy notifier called when the parse context is freed

Returns :

a new GMarkupParseContext

g_markup_parse_context_parse ()

gboolean            g_markup_parse_context_parse        (GMarkupParseContext *context,
                                                         const gchar *text,
                                                         gssize text_len,
                                                         GError **error);

Feed some data to the GMarkupParseContext. The data need not be valid UTF-8; an error will be signaled if it's invalid. The data need not be an entire document; you can feed a document into the parser incrementally, via multiple calls to this function. Typically, as you receive data from a network connection or file, you feed each received chunk of data into this function, aborting the process if an error occurs. Once an error is reported, no further data may be fed to the GMarkupParseContext; all errors are fatal.

context :

a GMarkupParseContext

text :

chunk of text to parse

text_len :

length of text in bytes

error :

return location for a GError

Returns :

FALSE if an error occurred, TRUE on success

g_markup_parse_context_push ()

void                g_markup_parse_context_push         (GMarkupParseContext *context,
                                                         GMarkupParser *parser,
                                                         gpointer user_data);

Temporarily redirects markup data to a sub-parser.

This function may only be called from the start_element handler of a GMarkupParser. It must be matched with a corresponding call to g_markup_parse_context_pop() in the matching end_element handler (except in the case that the parser aborts due to an error).

All tags, text and other data between the matching tags is redirected to the subparser given by parser. user_data is used as the user_data for that parser. user_data is also passed to the error callback in the event that an error occurs. This includes errors that occur in subparsers of the subparser.

The end tag matching the start tag for which this call was made is handled by the previous parser (which is given its own user_data) which is why g_markup_parse_context_pop() is provided to allow "one last access" to the user_data provided to this function. In the case of error, the user_data provided here is passed directly to the error callback of the subparser and g_markup_parse_context() should not be called. In either case, if user_data was allocated then it ought to be freed from both of these locations.

This function is not intended to be directly called by users interested in invoking subparsers. Instead, it is intended to be used by the subparsers themselves to implement a higher-level interface.

As an example, see the following implementation of a simple parser that counts the number of tags encountered.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
typedef struct
{
  gint tag_count;
} CounterData;

static void
counter_start_element (GMarkupParseContext  *context,
                       const gchar          *element_name,
                       const gchar         **attribute_names,
                       const gchar         **attribute_values,
                       gpointer              user_data,
                       GError              **error)
{
  CounterData *data = user_data;

  data->tag_count++;
}

static void
counter_error (GMarkupParseContext *context,
               GError              *error,
               gpointer             user_data)
{
  CounterData *data = user_data;

  g_slice_free (CounterData, data);
}

static GMarkupParser counter_subparser =
{
  counter_start_element,
  NULL,
  NULL,
  NULL,
  counter_error
};

In order to allow this parser to be easily used as a subparser, the following interface is provided:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
void
start_counting (GMarkupParseContext *context)
{
  CounterData *data = g_slice_new (CounterData);

  data->tag_count = 0;
  g_markup_parse_context_push (context, &counter_subparser, data);
}

gint
end_counting (GMarkupParseContext *context)
{
  CounterData *data = g_markup_parse_context_pop (context);
  int result;

  result = data->tag_count;
  g_slice_free (CounterData, data);

  return result;
}

The subparser would then be used as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
static void start_element (context, element_name, ...)
{
  if (strcmp (element_name, "count-these") == 0)
    start_counting (context);

  /* else, handle other tags... */
}

static void end_element (context, element_name, ...)
{
  if (strcmp (element_name, "count-these") == 0)
    g_print ("Counted %d tags\n", end_counting (context));

  /* else, handle other tags... */
}

context :

a GMarkupParseContext

parser :

a GMarkupParser

user_data :

user data to pass to GMarkupParser functions

Since 2.18


g_markup_parse_context_pop ()

gpointer            g_markup_parse_context_pop          (GMarkupParseContext *context);

Completes the process of a temporary sub-parser redirection.

This function exists to collect the user_data allocated by a matching call to g_markup_parse_context_push(). It must be called in the end_element handler corresponding to the start_element handler during which g_markup_parse_context_push() was called. You must not call this function from the error callback -- the user_data is provided directly to the callback in that case.

This function is not intended to be directly called by users interested in invoking subparsers. Instead, it is intended to be used by the subparsers themselves to implement a higher-level interface.

context :

a GMarkupParseContext

Returns :

the user_data passed to g_markup_parse_context_push().

Since 2.18


enum GMarkupCollectType

typedef enum
{
  G_MARKUP_COLLECT_INVALID,
  G_MARKUP_COLLECT_STRING,
  G_MARKUP_COLLECT_STRDUP,
  G_MARKUP_COLLECT_BOOLEAN,
  G_MARKUP_COLLECT_TRISTATE,

  G_MARKUP_COLLECT_OPTIONAL = (1 << 16)
} GMarkupCollectType;

A mixed enumerated type and flags field. You must specify one type (string, strdup, boolean, tristate). Additionally, you may optionally bitwise OR the type with the flag G_MARKUP_COLLECT_OPTIONAL.

It is likely that this enum will be extended in the future to support other types.

G_MARKUP_COLLECT_INVALID

used to terminate the list of attributes to collect.

G_MARKUP_COLLECT_STRING

collect the string pointer directly from the attribute_values[] array. Expects a parameter of type (const char **). If G_MARKUP_COLLECT_OPTIONAL is specified and the attribute isn't present then the pointer will be set to NULL.

G_MARKUP_COLLECT_STRDUP

as with G_MARKUP_COLLECT_STRING, but expects a parameter of type (char **) and g_strdup()s the returned pointer. The pointer must be freed with g_free().

G_MARKUP_COLLECT_BOOLEAN

expects a parameter of type (gboolean *) and parses the attribute value as a boolean. Sets FALSE if the attribute isn't present. Valid boolean values consist of (case insensitive) "false", "f", "no", "n", "0" and "true", "t", "yes", "y", "1".

G_MARKUP_COLLECT_TRISTATE

as with G_MARKUP_COLLECT_BOOLEAN, but in the case of a missing attribute a value is set that compares equal to neither FALSE nor TRUE. G_MARKUP_COLLECT_OPTIONAL is implied.

G_MARKUP_COLLECT_OPTIONAL

can be bitwise ORed with the other fields. If present, allows the attribute not to appear. A default value is set depending on what value type is used.

g_markup_collect_attributes ()

gboolean            g_markup_collect_attributes         (const gchar *element_name,
                                                         const gchar **attribute_names,
                                                         const gchar **attribute_values,
                                                         GError **error,
                                                         GMarkupCollectType first_type,
                                                         const gchar *first_attr,
                                                         ...);

Collects the attributes of the element from the data passed to the GMarkupParser start_element function, dealing with common error conditions and supporting boolean values.

This utility function is not required to write a parser but can save a lot of typing.

The element_name, attribute_names, attribute_values and error parameters passed to the start_element callback should be passed unmodified to this function.

Following these arguments is a list of "supported" attributes to collect. It is an error to specify multiple attributes with the same name. If any attribute not in the list appears in the attribute_names array then an unknown attribute error will result.

The GMarkupCollectType field allows specifying the type of collection to perform and if a given attribute must appear or is optional.

The attribute name is simply the name of the attribute to collect.

The pointer should be of the appropriate type (see the descriptions under GMarkupCollectType) and may be NULL in case a particular attribute is to be allowed but ignored.

This function deals with issuing errors for missing attributes (of type G_MARKUP_ERROR_MISSING_ATTRIBUTE), unknown attributes (of type G_MARKUP_ERROR_UNKNOWN_ATTRIBUTE) and duplicate attributes (of type G_MARKUP_ERROR_INVALID_CONTENT) as well as parse errors for boolean-valued attributes (again of type G_MARKUP_ERROR_INVALID_CONTENT). In all of these cases FALSE will be returned and error will be set as appropriate.

element_name :

the current tag name

attribute_names :

the attribute names

attribute_values :

the attribute values

error :

a pointer to a GError or NULL

first_type :

the GMarkupCollectType of the first attribute

first_attr :

the name of the first attribute

... :

a pointer to the storage location of the first attribute (or NULL), followed by more types names and pointers, ending with G_MARKUP_COLLECT_INVALID.

Returns :

TRUE if successful

Since 2.16



[5] XML specification