Sent Successfully.
Home / Blog / Interview Questions / Data Science: File Types Using R and Python
Data Science: File Types Using R and Python
Table of Content
- JSON File
- HTML File
- CSV File
- ORC File
- SPSS File
- SAS File
- Matlab File
- Parquet File
- Stata File
- Weka File
- YAML File
- PDF File
- AVRO File
- mp4 File
- XML File
- PNG File
- JPEG File
- TIF File
- MP3 File
- DIF File
- WAV File
- ZAP File
- RAR File
- RSS File
- TXT File
- ISO FIle
- DBF File
- Markdown File
- DLL File
- RTF File
- BMP File
- GeoTIFF File
- HDF5 File
- AIFF File
- MOV File
- TSV File
- SWF File
- PSD File
- SVG File
Also, check this Data Science Institute in Bangalore to start a career in Data Science.
Looking forward to becoming a Data Scientist? Check out the Data Science Course and get certified today.
Learn the core concepts of Data Science Course video on YouTube:
Become a Data Scientist with 360DigiTMG Data Science course in Hyderabad Get trained by the alumni from IIT, IIM, and ISB.
Earn yourself a promising career in data science by enrolling in the Data Science Classes in Pune offered by 360DigiTMG.
-
JSON File: JSON stands for Javascript Object Notation. It is a language-agnostic easily parsable, readable, writable, and generatable text data interchange format. It is pillared on objects such as arrays, vectors, sequences, lists, etc., and collection on name-value pairs similar to dictionaries in python. The majority of the latest programming languages support data in JSON format as it supports universally accepted data structures. Also, it can be easily typecasted to transform its data structure to a data frame for easy data manipulation tasks.
R Code:
Python Code:
-
HTML File: HTML stands for Hypertext Markup Language. Its file extension is “.html”. HTML is used to create and manage the structure of web pages. It was developed in 1991 by a group of Engineers at CERN in Switzerland to easily manage and display web pages. It is a simple text format file that contains tags, tables, images, etc. that needs to be displayed on a webpage.
R Code:
Python Code:
-
CSV File: CSV stands for Comma Separated Values. The data is stored in plain text format which is generally separated by commas or semicolons. It is easily articulated as a data frame in R and Python for data manipulation.
R Code:
Python Code:
-
ORC File: ORC stands for Optimized Row Columnar. It is used for performance enhancement and storage of Hive data. The data in ORC format is organized in rows called stripes. It also has file footers in addition to stripes that provide supplementary information. Each stripe is of default size 250MB.
R Code:
NA
Python Code:
-
SPSS File: SPSS stands for Statistical Package for Social Sciences. SPSS was acquired by IBM and it is an IBM product now. Its file extension is ".sav". Any file in SAV format is stored in binary form which can be used only by SPSS. However, the SAV format files can also be used in R and Python as it gets converted to the requisite format.
R Code:
Python Code:
-
SAS File: SAS stands for Statistical Analysis System. It was developed by North Carolina State University finally in 1976 and post that SAS institute was incorporated and has managed SAS to date. It is used for analytics, data management, and Business Intelligence. Its file extension is given by ``.sas7bdat". The data in the SAS file is stored in rows and columns. It can easily be imported in R and Python and parsed as a data frame.
R Code:
Python Code:
-
Matlab File: Matlab is a programming tool designed by Mathworks. Generally, the tool is used by Engineers, Scientists, and Data Scientists. The tool can help in analyzing data, design algorithms, design models, and develop applications. Its file extension is ".mat". It is in binary container data format. Matrices and strings are supported in 4 MAT files. And multidimensional arrays, objects, strictures, etc are supported in 5 MAT files. These are internal levels with Matlab to store data. Matlab data files can easily be parsed in R and Python.
R Code:
Python Code:
-
Parquet File: To execute projects in a Hadoop environment, Parquet is used. It is an open-source file format. It has a similar flat columnar data storage as ORC which is very efficient. Its file extension is ``.parquet". It is extremely efficient in data encoding and compression. Also, it has been optimized to work with bulk data dealing with complexities. It can read the columnar data directly from large datasets without increasing the computational burden. Parquet datasets can be parsed to data frames in R and Python. Arrow package is now readily available in the CRAN repository and can be installed directly. Python uses the Pandas package just as any other file types to read the data files.
R Code:
Python Code:
-
Stata File: Stata is a statistical tool developed by StataCorp in 1985. Its file extension is ``.dta". Stata is used for research in the area of Social science, Bioscience, Medicine, Epidemiology, etc. Large data can be easily managed and stored using Stata. It is effective in performing data analytics and visualization. Just as rectangular excel or column separated values dataset, Stata has a 2-dimensional rectangular structure that is organized in rows and columns. The observations are arranged in rows and features are arranged in columns. Hence, it can be easily parsed as a data frame in R and Python environments.
R Code:
Python Code:
-
Weka File: The full form of Weka is Waikato Environment for Knowledge Analysis. It was written in Java. It is an open-source tool and can be used for data processing, analytics, machine learning, and visualization. Its file extension is ".arff". ARFF stands for Attribute Relation File Format. It is an ASCII text file. ARFF data files have header and data sections. The header section contains the title and attributes names. The data section contains instance lines across attributes delimited by commas. Any missing value is represented by a question mark. ARFF files are case sensitive. Interestingly, in Weka strings and nominal data are stored as numbers. Even Weka files are easily parsed in R and Python.
R Code:
Python Code:
-
YAML File:Full form for YAML is YAML Ain't Markup Language. Its extension is ".yaml". YAML files are user friendly and can be used easily with multiple programming languages. It is used to manage data. It has a markup language that distinguishes data-oriented language with document markup. It is able to match the data structures of other languages such as Python, Perl, Ruby, etc. YAML only allows the usage of space while creating the files and is case sensitive. Any line starting with a hash (#) is treated as a comment. For indentation, space is used as a tab is not permissible. Just as R and Python, the data structures have whitespace indentation denoting structures. Data within square brackets [ ] represent a list. The key-value pairs are created using curly brackets { } and colon (:).
R Code:
-
PDF File: The full form of PDF is a Portable document format. Its extension is ".pdf". PDF was invented by ADOBE. It is an extremely useful format of the file to store data in the form of text, images, tables, etc. The PDF documents can be easily exchanged irrespective of the operating system. PDF files are primarily used for viewing. Also, it does an excellent job of preserving the format of the data in which the data was originally prepared.
R Code:
Python Code:
-
AVRO File: AVRO is a data serialization system. AVRO was developed by Doug Cutting. He was also instrumental in developing Hadoop. AVRO data formats can easily be articulated with many languages but that is not the case with Hadoop. So, AVRO is used to serialize data for Hadoop. It has a binary schema. The schema is inbuilt. Also, AVRO files can be easily split and compressed. Its extension is ".avro". The file can be easily imported to python and processed.
R Code:
Python Code:
-
MP4 File: MP4 is a MPEG 4 video file format. The full form for MPEG is Motion Picture Experts Group. Its extension is ".mp4". It holds digital data in compressed format. All video players support the MP4 file format. Majorly it is used to store video and audio.
R Code:
Python Code:
-
XML File: The full form of XML is Extensible Markup Language. It is a flexible text format file that works independently of system being used. It is used to store and exchange a wide variety of data. Its file extension is “.xml”. It has markup tabs that help in explaining the meaning of the file data. It is widely used across many platforms.
R Code:
Python Code:
-
PNG File: PNG stands for Portable Network Graphics. PNG was developed to do a better job over GIF (Graphics Interchange Format). PNG’s file format is raster graphics. It is used to compress the data. It is able to store greyscale images and 24bits colored images. Its file extension is “.png”.
R Code:
Python Code:
-
JPEG File: The full form of JPEG is Joint Photographic Experts Group. Generally, JPEG’s are used for effortlessly sharing image files. Even after a lot of compressions the quality of the image is preserved. It is widely used on the internet, mobile phones, and computers. It is a very efficient data storage method as it requires minimum storage capacity. Its file extension is “.jpeg” and “.jpg”.
R Code:
Python Code:
-
TIF File: TIF stands for Tagged Image Format. TIF preserves high-quality images. Adobe had acquired the format from Aldus Corporation and improved its manifold. It can contain compressed and uncompressed images. TIF files can be easily converted to PDF, GIF, JPEG, etc formats. Its file extension is “.tif”. TIF is capable of holding high colour depth images.
R Code:
Python Code:
-
MP3 File: MP3 is used to store audio data. It is widely used to store, compress, and easily share the audio files. However, the compression is irreversible yet it gives very high-quality audio. Interestingly, the loss during compression is only to the extent that human ears cannot detect. File extension for MP3 is “.mp3”.
R Code:
Python Code:
-
DIF File: DIF stands for Data Interchange Format. It stores text data in regular spreadsheet-style, however, it cannot handle multiple spreadsheets at once. Its file extension is “.dif”.
R Code:
Python Code:
-
WAV File: The full form of WAV is Waveform Audio File Format. Its file extension is “.wav”. It was jointly developed by Microsoft and IBM. Before MP3 audio files were generally played in WAV format.
R Code:
Python Code:
-
ZIP File: There is no full form for ZIP. Its file extension is “.zip”. It is used to compress files and data in binary file format. It is also used for archival purposes. It can compress many files at once and the ZIP file can also be de-compressed to get the original files stored in it. It is extremely handy for exchanging large size files.
R Code:
Python Code:
-
RAR File: RAR full form is Roshal Archive. It has been named after its developer Eugene Roshal. Just as ZIP file, RAR file is also used to compress and archive multiple files. Its files extension is “.rar”. It is also extremely handy in exchanging large size files.
Python Code:
-
RSS File: RSS full form is Rich Site Summary. On all websites, the content is regularly updated. To share the updated content, websites generally allow to access the feeds through RSS. Users can readily extract the information for their need.
R Code:
Python Code:
-
TXT File: TXT stands for Text file format. Its file extension is “.txt”. It stores data in plain text style with extremely limited formatting options. It stores the data in sequence. The sequences are stored as line. Like we have lines in a book.
R Code:
Python Code:
-
ISO File: ISO stands for International Organization for Standardization. Its file extension is “.iso”. This is a file type that stores images or data from CDs, DVDs, etc. Specifically, ISO 9660 file type is defined for media stored in optical discs.
R Code:
NA
Python Code:
-
DBF File: DBF refers to the database. Its file extension is “.dbf”. It can store huge numbers of digital files that are properly indexed. The data stored in these files can be easily looked up, manipulated, juxtaposed, and cited. The components of the database are schema (it can store multiple tables), table (a 2-dimensional object with rows and columns), rows (to store observations), and columns (to store different data types such as numeric, character, etc.)
R Code:
Python Code:
-
Markdown File: Markdown is known as a non-heavy markup language. Markdown files are often referred to as developer files. It stores data in plain text format. It easily reads and writes, Markdown text files are generally converted to HTML files. However, it is not treated as a replacement for HTML files. The only goal of Markdown is readability. Its file extension is “.md”.
R Code:
Python Code:
-
DLL File: DLL’s full form is a dynamic link library. It is a common library used to follow protocols to perform tasks. A lot of programs are able to use the collection of resources in this library. Like if we have to save a file to the system locally, the DLL provides resources to facilitate the steps internally to fulfill the action. Because of DLL developers are able to write programs easily. Its file extension is “.dll”
R Code:
Python Code:
-
RTF File: The full form of RTF is the Rich Text Format. Its file extension is “.rtf”. RTFs are a combination of plain text and rich text files. There are extremely limited formatting features in a text file, however, the rich text offers more formatting features as compared to a text file.
R Code:
Python Code:
-
BMP File: BMP file is a bitmap image file format. Its extension is “.bmp”. It does not require a graphics adapter to display images and can be in uncompressed or compressed format. Bitmaps can hold grey scale and coloured images in 2 dimensions.
R Code:
Python Code:
-
GeoTIFF File: GeoTiff is like a regular “.tif” image file format. It has spatial information as tags. These tags are call ed embedded tags. GeofTIFF files carry the following metadata information:
- Image Resolution
- Layers
- Coordinate Information System
- Area coverage
- No Data Value
R Code:
Python Code:
-
HDF5 File: The full form of HDF5 is Hierarchical Data Format 5. Its file extension is “.hdf5”. HDF5 is used to store large amounts of data in a hierarchical structure and is open source.
It is extremely handy in retrieving parts of data rather than the whole at once. It is extremely powerful in accessing and searching as it provides metaset along with the data. It supports heterogeneous and complex data.
R Code:
Python Code:
-
AIFF File: The full form of AIFF is Audio Interchange File Format. It is a file type to store audio data using electronic devices. It was developed by Apple and has been extensively used for audio purposes. It's an uncompressed file. Hence, the audio quality is very good but would generally take more space than the MP3 file.
Python Code:
-
MOV File: MOV is a file format that can contain timecode, audio, text tracks, and videos. It is a multimedia container. Its file extension is “.mov”. It was developed by Apple and is compatible with MS and Mac.
Python Code:
-
TSV File: The full form of TSV is Tab Separated Values. Its file extension is “.tsv”. Just as CSV, TSV is a 2-dimensional file format used with a spreadsheet.
R Code:
Python Code:
-
SWF File: The full form of SWF is Small We Format file. It is also referred to as Shockwave. Its file extension is “.swf”. It is an Adobe flash file. It can contain movies and animations. It is used to deliver multimedia files over the web.
Python Code:
-
PSD File: The full form of PSD is Photoshop Document. It is a layered image file. Its extension is “.psd”. Photoshop uses it as the default file format to preserve data. PSD files can be converted to any of the non-proprietary image file formats such as “.jpg”, “.tif”, etc. However, once the conversion is done on the original PSD file, then PSD format cannot be retrieved back.
Python Code:
-
SVG File: SVG stands for Scalable Vector Graphics. Its extension is “.svg”. SVG is used to describe 2-dimensional graphics. Only 3 graphics formats can be used in SVG:
- Text
- Images
- Vector graphic shapes (straight lines or curves)
It is primarily used to present the information in a rich graphical format. It is an XML application and is HTML compatible. The graphical objects in SVG can be segmented, transformed, blended, and designed. The files can be rendered in various formats such as PDF, PNG, etc.
R Code:
Python Code:
Data Science Placement Success Story
Data Science Training Institutes in Other Locations
Agra, Ahmedabad, Amritsar, Anand, Anantapur, Bangalore, Bhopal, Bhubaneswar, Chengalpattu, Chennai, Cochin, Dehradun, Malaysia, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Gwalior, Hebbal, Hyderabad, Jabalpur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Khammam, Kolhapur, Kothrud, Ludhiana, Madurai, Meerut, Mohali, Moradabad, Noida, Pimpri, Pondicherry, Pune, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thane, Thiruvananthapuram, Tiruchchirappalli, Trichur, Udaipur, Yelahanka, Andhra Pradesh, Anna Nagar, Bhilai, Borivali, Calicut, Chandigarh, Chromepet, Coimbatore, Dilsukhnagar, ECIL, Faridabad, Greater Warangal, Guduvanchery, Guntur, Gurgaon, Guwahati, Hoodi, Indore, Jaipur, Kalaburagi, Kanpur, Kharadi, Kochi, Kolkata, Kompally, Lucknow, Mangalore, Mumbai, Mysore, Nagpur, Nashik, Navi Mumbai, Patna, Porur, Raipur, Salem, Surat, Thoraipakkam, Trichy, Uppal, Vadodara, Varanasi, Vijayawada, Vizag, Tirunelveli, Aurangabad
Navigate to Address
360DigiTMG - Data Science, Data Scientist Course Training in Bangalore
No 23, 2nd Floor, 9th Main Rd, 22nd Cross Rd, 7th Sector, HSR Layout, Bengaluru, Karnataka 560102
1800-212-654-321