The UWA-data oracle helps users gain a better understanding of their data by generating reports that categorize columns into data types and based on this, runs specialised error detection and analysis. This helps identify anomalies in data that other programs that look at general file errors can't pick up. The tool is easy to use and its advanced settings are simple to configure as the Customise tool walks users step-by-step through the process and allows reports to be tailor-made to fit the data.
For a comparison here are some other programs which also help users clean and analyse their data, but do so in different ways.
CSV Lint by The Open Data Institute is a free online application that "helps you to check that your CSV file is readable". Files can be passed through url address or uploaded from your computer.
Csv Lint looks at the formatting of csv files focusing on errors preventing files from being read by other csv reading programs such as Microsoft Excel. It looks for errors such as file encoding, whitespaces, unclosed quote and empty rows. It can also give warnings about delimiters other than commas. inconsistent values and empty column names.
It also has more advanced options such as adding constraints on the data in the columns; however, this requires writing a JSON schema which can be complicated for beginner programmers. These constraints include minimum and maximum lengths, data types and patterns through regular expressions. Csv Lint has some of the same supported data types as Data-oracle and can recognise when cells do not conform with this type.
The difference between the two programs is that Data-oracle checks for errors relating to the data contained in the file, rather than focusing on formatting errors of the file. It checks for errors differently depending on the error type resulting in more errors being recognised leading to cleaner data. It also gives analysis reports on the file giving users information about the values in each column, allowing users to understand their data before doing further analysis.
More information about Csv Lint can be found at https://csvlint.io/about
OpenRefine is a free, open source tool for working with messy data. It is not limited to csv files and can help users edit their data. It is run locally on users machine through their browser so no files are uploaded meaning privacy is not an issue.
OpenRefine is a very hands-on tool allowing the user to manually explore their data to fix inconsistencies by looking at things like frequencies of entries and distributions of values. It also gives tools that help convert all kinds of files and text formats into csv files and tools for cleaning text and numbers. It also has some unique tools such as being able to populate files by connecting to online data bases. However new users of OpenRefine may be put off by the extensive amount of options in the software and actually using the program can be quite difficult to non-programmers.Additionally OpenRefine was not meant for the processing of large files and will freeze when trying to process these large files.
Data-oracle is much more accessible with analysis and error detection done by the program instead of by the user. Errors can also be corrected using the in-built Customise tool which is more intuitive and straight forward, though the functionality is not as extensive.
More information about OpenRefine can be found at http://openrefine.org/
Csv Easy is a paid service, requiring the one-off purchase of a license, which provides a great tool for editing csv files. Similar to Excel it displays your file and allows you to edit cells and perform actions on columns such as Filtering, Sorting and Tidying.
It also has a good Data Analysis function which analyses files very quickly and provides reports similar to Data-oracle. Its reports describe the properties of columns such as data type, minimum and maximum width, percentage populated and various metrics for numeric types such as sum, median and standard deviation.
It supports the data types: Integer, floating point, text and date time; however, if there are errors in the columns it can skew the analysis making it unable to correctly assign data types. So while being able to generate reports faster than Data-oracle it does not account for errors in the data and lack the ability to identify them.
For more information about CSV Easy visit their website at http://csveasy.com/
A project for the UWA System Health Lab | License | Contact Us