Updating DSC data

Handling new data in DSCng

This document describes the various way in which you can update your DSCng database with new DSC data.

New data in the DSC system are submitted to the presenter via uploaded XML files. These are sent once a minute by the collector(s) and processed at the presenter using a cron script. By using these files, you can get new data into DSCng as soon as they arrive.

An alternative way for updates is to use the .dat files, that are created by DSC and that are used for initial import for the updates as well. This method and its advantages and disadvantages will be discussed in the corresponding chapter.

Using DSC transport XML files

The preferred method for getting newly arriving data into DSCng's database is by using the XML files DSC uses for communication between the collector and the presenter.

At present, there are two ways to accomplish this task, with the preferred one being the more complicated to setup, but faster and safer to run. The alternative way does not require much setting up, but has some disadvantages and may become unsupported in future versions of DSCng, so it is best avoided.

Both the presented way require that you tell the DSCng system about the format of data you will be using. You do this by calling:

python register_data_source.py data/dsc_xml.data_source.json

in the dscng source directory. You have to do this only once after clean DSCng install.

Update daemon

The best way to handle arriving XML files is to run a dedicated update daemon, to which you submit the data from a cron script (similarly to how the original DSC does it) and it will take care of it.

The daemon queues the incomming data and pushes it into the database in the background. It uses a RPC protocol for data upload and is thus prepared to one day replace the current method of uploading XML files by a more elegant and efficient method. It also uses several processes in order to make the processing of data faster on multicore/multiprocessor machines.

Because at present, we deal with XML data already uploaded to the presenter, this documentation describes how to set the daemon up for local access only.

Configuration

The update daemon uses a standard format config file for its settings. An example version of this file is located in dscng/update_daemon.conf.example. You can copy this file into a new configuration file, such as update_daemon.conf and edit is to match your setup. The following part describes individual options:

daemon.hostname:
hostname (or an address) defines the interface on which the daemon is to listen to new RPC connections. Use 'localhost' for local-only setup.
daemon.port:
port on which the daemon should listen.
daemon.user:
name of user under which the daemon should run. The daemon is usually started by root and drops its privileges by using setuid after is was successfully started. Please see Dedicated user below for more information.
daemon.pidfile:
a name of a file into which the daemon writes its pid. Only for use by experts
daemon.journal_dir:
a name of a directory in which the daemon will journal arriving data for crash protection. This directory has to exist and be writable by the daemon user.
daemon.debug:
should the daemon print out more verbose information in logs
authentication.client_authentication:
True or False. Defines if the client must authenticate itself before being allowed to connect to the daemon
authentication.auth_file:
name of a file containing a password for password based authentication. Implies authentication.client_authentication = True.
logging.log_file:
name of file that will be used for logging of the server's activity

In most cases it should not be necessary to change any of the default values used in the example file. Just make sure that the directory which is defined in journal_dir exists and is writable by the set user and that the directory into which the pid file should be written exists.

Password protection

It is possible to set up password authenticaton for the client connecting to the daemon. This is useful even in case of local access because it allows limiting local users from using lowlevel access to the daemon by using the RPC directly.

The password is never exchanged between the client and the server and a challenge-response mechanism is used instead in which the client uses the password to create a response to a server created challenge. Thus this method is safe from eavesdropping, but requires that the password is saved in plain text at both the server and the client. You should thus make sure that the file containing the password is only readable by the chosen users.

To create a authentication file for password authentication, you just put the password as the only content into a file. The best method would be to create a random password, for example by running the following:

head -c 24 /dev/urandom | base64

You can the copy this file to the client without needing to remember the password.

You than need to add the following settings to the daemon config file:

[authentication]
client_authentication = True
auth_file = update_daemon.passwd

Where update_daemon.passwd is the name of the file where the password is stored. You also need to pass a copy of this password file to the update_from_xml.py script. This will be discussed later.

Dedicated user

For maximum security, it is recommended to create a specific user without an active shell and without special privileges to run the update daemon. The daemon is capable of running as an unprivileged user by switching to a different user after it was successfully started. To specify a non-privileged user account for the daemon, just add the following to your config file:

[daemon]
user = dscng

Or extend the [daemon] section with the user setting.

However, before you do this, you should create the user at hand, for example by running the following:

sudo adduser --system --group --no-create-home --home /nonexistent dscng

Where 'dscng' is the name of the new user. The --group switch specifies that a corresponding group will be also created which may be useful for example to use for users which should be able to connect to the daemon and thus allowed to read the authentication file.

Once the user is created and set up in the config file, it is recommended to also use it to limit access to the daemon journal directory by running the following (as root):

chown dscng /path/to/journal/
chmod 755 /path/to/journal/
Running the daemon

When the update daemon is configured, you can start it by running (as root):

python update_daemon.py -c update_daemon.conf start

Where update_daemon.conf is the path to the configuration file. You can also override any of the config file settings on command line. To see a list of corresponding switches, run the daemon like this:

sudo python update_daemon.py -h

To stop the daemon, you use the same program, just with 'stop' as command, instead of 'start':

python update_daemon.py -c update_daemon.conf -d stop

This will trigger a graceful shutdown of the daemon, which means that it will process all queued data, while refusing to accept new ones, and the exit. You can accomplish the same result by sending the INT signal to the daemon process:

kill -SIGINT pid

where pid is the PID of the daemon process (available in the daemon pidfile).

If a hastier shutdown is desired, the TERM signal may be used:

kill pid

or:

kill -SIGTERM pid

This will trigger immediate (yet still graceful) shutdown - all data from the queue will be ignored and the daemon will stop without processing it (the data will remain in the journal and will be processed once the deamon is started again).

Submitting data to the daemon

The XML files are pushed to the daemon by the 'update_from_xml.py' script:

python update_from_xml.py path/to/dsc/xml/files/*.xml

It is possible to pass a hostname, port and authentication file to the script by using command line arguments. To find more, run the script as follows:

python update_from_xml.py -h

This script will parse the XML files, push the data into the daemons queue and exit. The daemon will then start to process the data in the queue, so the fact that 'update_from_xml.py' has finished does not mean that the data is already in the database.

If you have set up the daemon to require password authentication (see Password protection), you need to supply the password file (or rather its copy) to the update script as well. To do so, add the -f (--auth-file) argument:

python update_from_xml.py --auth-file pass_file path/to/dsc/xml/files/*.xml

where 'pass_file' is the name of the file containing the password.

Direct database access

This method of updates uses a feature of the above mentioned 'update_from_xml.py' script which allows the data to be directly inserted into the database without going through the daemon first.

This method is not asynchronous as the daemon method is, so the script will run as long as it is necessary to do the update. It also cannot make use of some pre-cached data which the daemon uses and does not do queueing of the data. Thus it is not safe against several update processes run at once and it is up to you to ensure that only one instance of 'update_from_xml.py' is run at one time.

This method requires the user that runs the script to have access to the database, which is not necessary in case of the daemon.

Because of these limitations, it is recommended to use the daemon. On the other hand, the direct method is simpler to implement, as it requires just the following:

python update_from_xml.py --direct-db-access path/to/dsc/xml/files/*.xml

Using DSC .dat files

DSC stores its production data in .dat files. These files always contain data for a whole day and are periodically appended to during the corresponding day. Because DSCng import script can detect double imports and deal with them, it is able to update data from the .dat files as well. However, it has some disadvantages. While the .dat file import is generally faster that the update from XML files, it is optimized for whole day data. When the same file with updated content is imported a second time, DSCng detects the change and deals with it by removing all data for the corresponding day it already had and reimporting the file from scratch. This means that if you use the .dat files for updates every minute, you will be constantly deleting and reinserting data into your database. This does not necessarily mean that it would be much slower than the XML updates, but is surely not optimal.

A possible way to deal with this problem is to update the data less frequently, down to only once a day, but by doing this, you will sacrifice the 'up-to-date'ness of the updates.

If you are for some reason still willing to implement this method, you can use the same general method described in the installation guide for initial import of data. However, for an efficient use for updates, some adjustments are necessary:

  1. Make sure that you run the import script with as specific path to the imported data as possible. For the initial import of data, it is ok to run the import script by providing the path to all the DSC .dat files you have. The import script will wade through all the potential thousands of files and find the ones it has not already seen. However, this will take some time and in case of many files, it is very likely that it will take longer than the minute you have before next data will arrive and a new import will be needed. Therefore running the script only over the affected directory, such as /usr/local/dsc/data/dns-s/ns0/20120920/ instead of /usr/local/dsc/data/ will be much faster. Hint: depending on your shell and its settings, it might be possible to specify all server directories for one day using the following pattern - /usr/local/dsc/data/*/**/20120920/.
  2. Make sure that you do not run the import script several times in parallel, for example by running in response to new data without making sure all data was already processed and the old import process ended. The import script is optimized for continuous batch import of data and not compatible with several independently run copies. It would certainly crash or at least garble your data.