Kevaccino Kevaccino - 11 months ago 71
HTML Question

Extract Data From Multiple Files

I have exactly 278 Html files of essays from different students, every file contains student id, first name and last in the following format

<p>Student ID: 000000</p>
<p>First Name: John</p>
<p>Last Name: Doe</p>

I'm trying to extract Student IDs from all this files, is there a way to extract data between X and Y? X being "
<p>Student ID:
" and Y being "
" which should leave us with ID

What Method/Language/Concept/Software would you recommend to get this work done?


Using java:

import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;

public class StudentIDsRetriever {

    public static void main(String[] args) throws IOException {
        File dir = new File("htmldir");
        String[] htmlFiles = dir.list();
        List<String> studentIds = new ArrayList<>();
        List<String> emailDs = new ArrayList<>();
        for (String htmlFile : htmlFiles) {
            Path path = FileSystems.getDefault().getPath("htmldir", htmlFile);
            List<String> lines = Files.readAllLines(path);
            for (String str : lines) {
                if (str.contains("<p>Student ID:")) {
                    String idTag = str.substring(str.indexOf("<p>Student ID:"));
                    String id = idTag.substring("<p>Student ID:".length(), idTag.indexOf("</p>"));
                    System.out.println("Id is "+id);

                if (str.contains("@") && (str.contains(".com") || str.contains(""))) {
                    String[] words = str.split(" ");
                    for (String word : words) 
                        if (word.contains("@") && (word.contains(".com") || word.contains(""))) 

        System.out.println("Student list is "+studentIds);
        System.out.println("Student email list is "+emailDs);

P.S: This works from Java7+