Dynamic, Rule-based Quality Control Framework for Real-time Sensor Data - PowerPoint PPT Presentation

About This Presentation
Title:

Dynamic, Rule-based Quality Control Framework for Real-time Sensor Data

Description:

Statistics can be generated with/without flagged values ... Flagged, missing values can be summarized by parameter and date for metadata ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 20
Provided by: wadesh
Category:

less

Transcript and Presenter's Notes

Title: Dynamic, Rule-based Quality Control Framework for Real-time Sensor Data


1
Dynamic, Rule-based Quality Control Framework for
Real-time Sensor Data
  • Wade Sheldon
  • Georgia Coastal Ecosystems LTER
  • University of Georgia

2
Introduction
  • Quality Control of high volume, real-time data
    from automated sensors is an emerging challenge
  • Traditional techniques (plotting, stats) often
    dont scale well
  • Data validation and Q/C can be limiting factor in
    getting data online
  • Difficulties lead to release delays or posting
    provisional data
  • Software developed at Georgia Coastal Ecosystems
    LTER has proven useful for Q/C of real-time data
  • Designed to automate GCE data processing and
    metadata generation, but very generalized and
    supports any tabular data
  • Provides dynamic, rule-based Q/C framework for
    data processing, analysis and synthesis

3
Framework Components
  • Comprehensive data model
  • Implemented as hierarchical MATLAB structure
    arrays
  • Package dataset attribute metadata, data, Q/C
    rules, qualifier flags
  • Metadata-based MATLAB software (GCE Data Toolbox)
  • Automatic (rule-based) and manual assignment of
    Q/C qualifier flags
  • Transparent management of flags throughout all
    data manipulation
  • Q/C-aware data management and analysis tools
  • Q/C-aware data integration and synthesis tools
  • Modular implementation supports many scenarios
  • Interactive (command-line API and GUI forms)
  • Automated workflows (timed or triggered)
  • End-to-end (logger-to-scientist) or part of
    larger workflow
  • Runs natively on multiple platforms (PC, nix,
    MacOS)

4
GCE Data Toolbox Data Model
5
Quality Control Rules
  • Basic syntax logical expressionflag code
  • Logical Expressions
  • Any conditional statement or call to MATLAB
    function that returns logical array (0 false,
    1 true)
  • Dataset columns referenced in statements as
  • x alias for current column (e.g. xlt0)
  • col_name any dataset column by name (e.g.
    col_Depthlt0)
  • Flag Codes
  • Alphanumeric character to assign when expression
    true (I, q, 9, )
  • Codes defined in the dataset metadata (I
    invalid value, )
  • Unlimited rules per attribute, multiple flags per
    value

6
Quality Control Rule Examples
  • Numeric Comparisons
  • Simple
  • xlt0I (flags negative values)
  • xlt0Ixgt100Ixlt20Qxgt80Q (overlapping
    bounds checks)

7
Quality Control Rule Examples
  • Numeric Comparisons
  • Simple
  • xlt0I (flags negative values)
  • xlt0Ixgt100Ixlt20Qxgt80Q (overlapping
    bounds checks)
  • Statistical
  • xgt(mean(x)3std(x))Qxlt(mean(x)-3std(x))Q
    (flags values more than 3 standard deviations
    from column mean)

8
Quality Control Rule Examples
  • Numeric Comparisons
  • Simple
  • xlt0I (flags negative values)
  • xlt0Ixgt100Ixlt20Qxgt80Q (overlapping
    bounds checks)
  • Statistical
  • xgt(mean(x)3std(x))Qxlt(mean(x)-3std(x))Q
    (flags values more than 3 standard deviations
    from column mean)
  • Multi-column
  • col_DOCgtcol_TOCI (in column DOC flags DOC
    exceeding TOC)
  • col_Dry_Weightlt(col_Wet_Weight-col_Ash_Weight)0.9
    0 I (flags dry weights below 90 wet
    weight ash weight)
  • col_Depthlt0I (in column Salinity flags
    Salinity when Depth lt 0)

9
Quality Control Rule Examples
  • Numeric Comparisons
  • Simple
  • xlt0I (flags negative values)
  • xlt0Ixgt100Ixlt20Qxgt80Q (overlapping
    bounds checks)
  • Statistical
  • xgt(mean(x)3std(x))Qxlt(mean(x)-3std(x))Q
    (flags values more than 3 standard deviations
    from column mean)
  • Multi-column
  • col_DOCgtcol_TOCI (in column DOC flags DOC
    exceeding TOC)
  • col_Dry_Weightlt(col_Wet_Weight-col_Ash_Weight)0.9
    0 I (flags dry weights below 90 wet
    weight ash weight)
  • col_Depthlt0I (in column Salinity flags
    Salinity when Depth lt 0)
  • Compound (Boolean operators)
  • col_RH_Percentgt100col_Preciplt0.1Q (flags
    humidity gt 100 except during significant
    precipitation events)

10
Quality Control Rule Examples (cont.)
  • Text Comparisons
  • IS, NOT for string literals, IN, NOT IN
    for lists
  • flag_notinlist(x,Spartina,Juncus,Zizaniopsis)Q

11
Quality Control Rule Examples (cont.)
  • Text Comparisons
  • IS, NOT for string literals, IN, NOT IN
    for lists
  • flag_notinlist(x,Spartina,Juncus,Zizaniopsis)Q
  • Algorithmic Criteria (custom functions)
  • fn(columns,parameters)Q
  • Various included Q/C functions
  • pattern checks, geographic checks, specialized
    algorithms (O2 saturation, etc)
  • User-defined functions
  • Any MATLAB code or wrapped calls to FORTRAN,
    Java, Python, etc
  • Unlimited scope

12
Quality Control Rule Examples (cont.)
  • Text Comparisons
  • IS, NOT for strings, IN, NOT IN for lists
  • flag_notinlist(x,Spartina,Juncus,Zizaniopsis)Q
  • Algorithmic Criteria (custom functions)
  • fn(parameters)Q
  • Various included Q/C functions
  • pattern checks, geographic checks, specialized
    algorithms (O2 saturation, etc)
  • User-defined functions
  • Any MATLAB code or wrapped calls to FORTRAN,
    Java, Python, etc
  • Unlimited scope
  • Full suite of MATLAB numeric analysis
    capabilities supported, and extensible to use
    other technology

13
Q/C Rule Management
  • Rule definitions can be defined in metadata
    templates, automatically applied to attributes
    when raw data imported
  • Rules can also be created, managed using a GUI
    form

14
Q/C Flag Assignment
  • Q/C criteria evaluated to assign/clear flags
    when
  • Metadata template applied or Q/C criteria edited
  • New data records, columns added
  • Values edited (GUI) or columns updated (CLI)
  • Evaluation function (dataflag) invoked directly
  • Flags can also be assigned/cleared manually by
  • Clicking/dragging on plots with the mouse
  • Using a spreadsheet-like grid
  • Importing from text attributes (e.g. 3rd party
    codes)
  • Propagating flags from source column(s) to
    dependent column(s)
  • Manual assignment locks flags by inserting
    manual token in criteria, removing manual
    restores automatic evaluation

15
Q/C-Aware Data Management Analysis
  • Q/C flags can be visualized in data editor grid
    and plots
  • Flagged values can be selectively removed from
    data sets
  • Statistics can be generated with/without flagged
    values
  • Flags can be instantiated as coded text columns
    for export
  • Flagged, missing values can be summarized by
    parameter and date for metadata

16
Q/C-Aware Data Synthesis
  • Flagged, missing values summarized in re-sampled
    data (aggregated, binned, date-time resampled),
    with automatic Q/C rule creation
  • Flags automatically locked when merging
    multiple data sets (i.e. unions)
  • All Q/C operations logged to processing history,
    reported in metadata to document lineage

17
Implementation Scenarios
  • End-to-End (logger-to-scientist)
  • Acquire raw data from logger or file system
    (standard or custom import filters)
  • Assign metadata from template or using forms to
    validate and flag data
  • Review data and fine-tune flag assignments
  • Generate distribution files plots, archive
    data, index for searching
  • Desktop data management solution
  • Data Pre-processing
  • Acquire, validate and flag raw data (on demand or
    timed/triggered)
  • Upload processed data files (e.g. csv) or value
    flag arrays to RDBMS
  • Workflow Step
  • Call toolbox functions as part of another
    workflow process, custom program
  • Kepler MATLAB actor?

18
Suitability for Real-Time Sensor Data
  • Good Scalability
  • Data volumes only limited by computer memory
    (tested gt2 GB data sets)
  • Multiple instances can be run on high-end, 64bit,
    clustered workstations
  • Good flag evaluation performance in use, testing
    with diverse rule sets
  • Good scope for automation
  • Timed and triggered workflow implementations easy
    to deploy
  • Support for multiple I/O formats, transport
    protocols
  • Formats ASCII, MATLAB, SQL, XML (partially
    implemented)
  • Transport local file system, UNC paths, HTTP,
    FTP, SOAP
  • Already used for real-time GCE data, USGS data
    harvesting service (LTER HydroDB, CWT)

19
Concluding Remarks
  • Benefits
  • Flexible, modular design
  • No qualifier vocabulary, semantics assumed many
    purposes, standards
  • Many operations on flagged values supports
    different strategies for archiving and
    distributing data at different processing levels
  • Limitations
  • Requires MATLAB
  • Rule syntax environment-specific a more open
    standard would be ideal
  • Support for XML metadata immature (but more
    development planned)
  • More information and downloads at
    http//gce-lter.marsci.uga.edu/public/im/tools/dat
    a_toolbox.htm
  • This work was supported by the National Science
    Foundation under grant numbers OCE-9982133 and
    OCE-0620959
Write a Comment
User Comments (0)
About PowerShow.com