When Manufacturing Plants Have So Much Data; Why Can't Anyone Actually Make Sense Of It?

Golgix

- Last Updated: May 26, 2026

Golgix

- Last Updated: May 26, 2026

Walk into most manufacturing plants today, and you'll find one thing in abundance: data. This includes temperature readings from fermenters, flow rates from processing lines, yield reports from execution systems, shift records tucked inside a SQL Server database that's been running since 2011, and more. The problem is using this data; all of it lives in completely different places, speaks completely different languages, and was never designed to talk to each other (sometimes intentionally, not designed to work together).

For engineering teams working across these environments, reliably solving the data collection problem is often the most underappreciated prerequisite for anything that comes after.

Layers of Manufacturing Data

Before we can solve a data collection problem, it helps to map where all the data actually lives. In a typical manufacturing environment, we're usually dealing with several of these together:

Programmable Logic Controllers (PLCs) are the workhorses of the factory floor. They control physical equipment and generate a continuous stream of tag-level data—motor states, valve positions, temperature setpoints, cycle times. Different PLC families from different vendors communicate differently. Some are straightforward to connect to; others require navigating proprietary protocols or specialized libraries (in some harsh cases, building our own from scattered specifications).
Historical data sources—software like Wonderware, OSIsoft PI, or dataPARC—record high-frequency process data over time. Historians are extremely good at what they do, but extracting data for analysis requires knowing which interface to use: REST API, OPC UA, Microsoft SQL Server, and more. The configuration and compatibility variations across versions and vendors are a nightmare when the aim is to produce standard output.
MES and ERP systems store operational and business data, including production orders, quality results, downtime codes, and inventory levels. These typically live within Microsoft SQL Server, and navigating them is a minefield, with access control issues and inconsistent schema documentation (which may or may not be up to date).
Edge devices and IoT sensors are increasingly common—edge devices push data to REST APIs or the control plane's message queues, and these control planes support various formats for periodic export. While these tend to be the most stable and modern, the challenges don’t end - lack of direct access to raw data for security reasons, unreliable export processes, dynamic file formats, and many more.

The common thread is that none of these systems was designed with cross-system analytics in mind; each was built to do its own job well, in isolation.

Why "Just Connect to Everything" Is Harder Than It Sounds

What! Don’t we simply pick a pipeline tool, configure our sources, pipe everything to a central sink? And done!

The reality is considerably messier. Each source type comes with its own baggage:

Protocol complexity: older PLCs may only support legacy industrial protocols. OPC UA has become an important modern standard for industrial data exchange, but not every system implements it in the same way, and many older historians don't support it at all (or, worse, provide partial data).
Authentication variety: one system uses OAuth2 with client credentials. Another uses Windows integrated authentication. Another uses a Service Account. Each adds operational overhead before the first data point is collected.
Gaps and data quality: connectivity issues occur, PLCs go offline for scheduled maintenance, and network interruptions break feeds. Any reliable system needs a way to backfill historical data during outages, without generating duplicates, and to know, in real time, when something has gone quiet.
Schema evolution: tags get renamed, new sensors get added mid-deployment, etc. A rigid pipeline breaks under these conditions, so resiliency must be built in from the start.

And more are the reasons why a single generic connector rarely works in practice. What we actually need is a purposeful library of connectors - each one tuned to the “quirks” of a specific source type, coordinated by a system that manages scheduling, retries, health, and beyond!

A Two-Track Approach to Data Collection

One effective approach is to build around two complementary collection strategies that together cover the range of sources typically encountered in modern manufacturing environments.

The first is a direct, low-latency polling system for real-time industrial sources, such as PLCs, historians, and protocol-level servers, such as OPC UA. The second is a scheduled ETL layer for business and operational databases. These are sources that don't need polling every few seconds, but do need to be synced reliably and regularly. Both tracks feed into a common data store, which is what the analytics layer reads from. From the analytics side, the original source is largely invisible. What matters is that the data is there, structured, and up to date.

The Part Everyone Underestimates: Monitoring

Here's something that becomes clear from real deployments: a data pipeline with no active monitoring is a liability you don't yet know about.

Collection failures are often silent. A PLC connection drops, and the pipeline quietly stops writing new data. A historian query begins timing out and returns empty results. An ETL job fails overnight, and nobody notices until someone pulls a report three days later and finds gaps they can't explain.

Monitoring should be built in from the start, not added as an afterthought. Every data source should have an explicit health state. A well-designed system tracks whether each source is producing data at its expected cadence, and if a source goes quiet for too long, an alert fires. Infrastructure-level monitoring matters too: system component health, memory usage, and database performance, because the conditions that precede a data quality problem are often visible before the problem itself surfaces.

This discipline matters especially in process industries like ethanol production, where continuous data feeds drive every early-warning and predictive model in a continuous analytics environment. A fermentation temperature feed that goes dark for six hours without anyone noticing takes away six hours of analytical context.

Why Data Collection Breadth Changes What's Possible

The honest answer is that breadth, reliability, and the ability to connect under real-world constraints are what separate solutions that work in demos from solutions that work in production. Manufacturing facilities aren't running clean, modern stacks. They're running a mix of older infrastructure and more modern equipment simultaneously, often under strict data security protocols and with no appetite for wholesale system replacement.

The right question to ask of any analytics platform isn't just, "Can it analyze data?" but "Can it actually reach our data, given the specific systems we've been running for years, under our constraints?"

Answering that question well requires deep investment in connector coverage, rigorous handling of connection failures, and operational observability that surfaces problems before they become data gaps. In industries where margins are tight and decisions depend on data freshness, reliability isn't a nice-to-have. It is the foundation on which everything else is built.

Conclusion

Manufacturing data integration is improving every day. OPC UA adoption is growing across vendors, more systems are exposing well-documented REST APIs, the open-source connector ecosystem is maturing, and that's good news for everyone building in this space. But the reality remains that most facilities today are very heterogeneous—a mix of old and new, multiple generations, standard and proprietary, documented and decidedly undocumented.

The analytics platforms that earn lasting trust in manufacturing will be the ones that meet facilities where they are—connecting to infrastructure as it actually exists, not as it ideally should be. The goal isn't to own the data sources. It's to make the data, wherever it lives, finally work together.