Data Capture

Overview

The Data Capture component is responsible for importing (“capturing”) new data into the Phaedra system. It typically starts from an external trigger, such as another program sending out a notification that a new measurement dataset is available for import. Technically, these notifications can take the form of an API call, or a message in a messaging system such as Amazon SQS, Amazon SNS or Apache Kafka.

Data Capture Flow

Capture Jobs

Once a data capture request is received, the Data Capture Service will load the appropriate capture configuration. This configuration includes a set of capture scripts that should be executed in order to fully capture the dataset, including:

  • well-level data
  • cell-level data
  • raw well images
  • well image masks/overlays

Usually, each type of data is captured by one capture script, so a full capture configuration might include 4 or more scripts.

The scripts can vary from fast and small (e.g. parsing a single CSV file) to large and heavy (e.g. converting and compressing thousands of TIFF image files), and therefore are not executed by the Data Capture Service itself. Instead, they are submitted to a queue and processed by a group of dedicated ScriptEngine worker nodes.

Once the worker nodes have finished executing all the capture scripts, the Data Capture Service will mark the measurement as captured and it will become available in Phaedra for further processing. At this point, an event is emitted via Kafka so that other components (such as the Pipeline Service) can proceed with downstream processing and calculation.