Viết chương trình phân tích cú pháp theo phương pháp earley. có mô phỏng thực hiện từng bước

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	19
Dung lượng	164 KB

Nội dung

HỌC VIỆN KỸ THUẬT QUÂN SỰ KHOA CÔNG NGHỆ THÔNG TIN -o0o BÀI TẬP LỚN Mơn: Lý thuyết chương trình dịch Đề tài: Viết chương trình phân tích cú pháp theo phương pháp Earley Có mơ thực bước Giáo viên hướng dẫn: Ts Hà Chí Trung Lớp: CHKHMT-K27B TPHCM, tháng 05 năm 2016 MỤC LỤC Tóm tắt Giải thuật Earley .4 a.Khởi tạo b Thuật toán .5 +) Dự đoán +) Duyệt +) Hoàn thiện Chương trình phân tích cú pháp câu theo phương pháp Early Parser .7 (Ngôn ngữ Java) Tài liệu tham khảo 19 Tóm tắt Giải thuật Earley giải thuật bản, sử dụng tương đối rộng rãi hệ thống phân tích cú pháp Tuy nhiên, giải thuật hạn chế sinh nhiều luật dư thừa q trình phân tích Trong này, chúng tơi đề xuất phương pháp phân tích cú pháp theo giải thuật Earley Giải thuật Earley giải thuật sử dụng phổ biến việc xây dựng hệ thống phân tích cú pháp Giải thuật sử dụng chiến lược phân tích kiểu xuống (top-down), bắt đầu với ký hiệu không kết thúc đại diện cho câu sử dụng luật khai triển thu câu vào Hạn chế cách tiếp cận không trọng nhiều đến từ đầu vào Vì q trình phân tích, giải thuật Earley sản sinh nhiều luật dư thừa.Ngoài ra, giải thuật Earley xây dựng cho tiếng Anh nên áp dụng cho tiếng Việt có hạn chế Mỗi câu vào tiếng Anh có cách tách từ, với tiếng Việt, câu vào có nhiều cách tách từ khác Với đặc điểm đầu vào giải thuật Earley câu với cách tách, phân tích cú pháp phải thực lặp lặp lại giải thuật cho trường hợp tách từ tiếng Việt Để giải vấn đề này, nhận thấy cách tách từ Việt tồn cặp cách tách giống danh sách từ loại khác phần đuôi chúng Giải thuật Earley bản, giúp người đọc hình dung cách khái quát giải thuật Giải thuật Earley Giải thuật Earley phát biểu sau: Đầu vào: Văn phạm G = (N, T, S, P), đó: • N: tập kí hiệu khơng kết thúc • T: tập kí hiệu kết thúc • S: kí hiệu khơng kết thúc bắt đầu • P: tập luật cú pháp Xâu vào w = a1a2 an Đầu ra: Phân tích w "sai" Kí hiệu: • α, β, γ biểu diễn xâu chứa kí hiệu kết thúc, khơng kết thúc rỗng • X, Y, Z biểu diễn kí hiệu khơng kết thúc đơn • a biểu diễn kí hiệu kết thúc Earley sử dụng cách biểu diễn luật thơng qua dấu chấm “• ” X→ α • β có nghĩa : • Trong P có luật sản xuất X→ α β • α phân tích • β chờ phân tích • Khi dấu chấm “ • ” chuyển sau β có nghĩa luật hoàn thiện Thành phần X phân tích đầy đủ, ngược lại luật chưa hoàn thiện Đối với từ thứ j xâu đầu vào, phân tích khởi tạo có thứ tự trạng thái S(j) Mỗi tương ứng với cột bảng phân tích Mỗi trạng thái có dạng (X → α • β, i), thành phần sau dấu phẩy xác định luật phát sinh từ cột thứ i a.Khởi tạo • S(0) khởi tạo chứa ROOT → • S • Nếu cuối ta có luật (ROOT → S•, 0) có nghĩa xâu vào phân tích thành cơng b Thuật tốn Thuật tốn phân tích thực bước: Dự đốn (Predictor), Duyệt (Scanner), Hoàn thiện (Completer) S(j) +) Dự đoán Với trạng thái S(j): (X → α • Y β, i), ta thêm trạng thái (Y → • γ, j) vào S(j) có luật sản xuất Y → γ P +) Duyệt Nếu a kí hiệu kết thúc Với trạng thái S(j): (X → α • a β, i), ta thêm trạng thái (X → α a • β, i) vào S(j+1) +) Hoàn thiện Với trạng thái S(j): (X → γ• , i), ta tìm S(i) trạng thái (Y → α • X β, k), sau thêm (Y → α X • β, k) vào S(j) Ở S(j) phải kiểm tra xem trạng thái có chưa trước thêm vào để tránh trùng lặp Để minh họa cho thuật toán trên, phân tích câu “học sinh làm tập” với tập luật cú pháp sau: S → N VP S → P VP S → N AP S → VP AP VP → V N VP → V NP NP → N N NP → N A AP → R A N → học sinh N → tập V → làm AP – cụm tính từ P – đại từ N – danh từ V – động từ A – tính từ R – phụ từ Trong đó: S – câu VP – cụm động từ NP – cụm danh từ Do câu có nhiều cách tách từ, đầu vào giải thuật Earley câu với cách tách từ nên minh họa giải thuật Earley với cách tách từ trường hợp câu phân tích là: học sinh, làm, tập Bảng phân tích cho cách tách sau: Cột ROOT • S, S •N VP, S •P VP, S •N AP, S •VP AP, VP •V N, VP •V NP, N •học sinh, N •bài tập, V •làm, N học sinh•, S N •VP, S N •AP, VP •V N, VP •V NP, AP •R A, V •làm, V làm•, VP V •N, VP V •NP, NP •N N, NP •N A, N •học sinh, N •bài tập, N tập•, VP V N•, NP N •N, NP N •A, S N VP•, ROOT S•, Bảng Bảng minh họa giải thuật Earley Chương trình phân tích cú pháp câu theo phương pháp Early Parser (Ngôn ngữ Java) EarleyParser Class import java.util.ArrayList; import java.util.HashMap; public class EarleyParser { public static class Node{ String text; ArrayList siblings = new ArrayList(); } Node(String s) { text=s; } class State{ class Mypair{ //need this to keep the order String key; ArrayList values; Mypair(String key, ArrayList values) { this.key = key; this.values = values; } } int i; // position in the sentence String left; int current; // position in the grammar rule ArrayList right; ArrayList parents; // each right has parents i) State(String left, int current, ArrayList right, int { this.i = i; this.left = left; this.right = right; this.current = current; parents = new ArrayList(); for(String r : right) { parents.add(new Mypair(r,new ArrayList())); } } public void parents(Node node_parent) //visit parents { for(Mypair pair : parents) { Node son = new Node(pair.key); for(State sparent : pair.values) { sparent.parents(son); } node_parent.siblings.add(son); } } public String toString() { String out = left + "->"; for(int k = 0; k < right.size(); k++) { if(k==current) out += "@"; out += right.get(k); } if(right.size()==current) out += "@"; } return "("+out+","+i+")"; public boolean equals(Object obj) { if(obj instanceof State) { State s2 = (State)obj; if(i != s2.i) return false; if(current != s2.current) return false; if(!left.equals(s2.left)) return false; if(right.size()!=s2.right.size()) return false; for(int k = 0; k < right.size(); k++) if(!right.get(k).equals(s2.right.get(k))) return false; return true; } return false; } } private private private private private Sentence words; HashMap grammar; String start; ArrayList charts; ArrayList trees; public EarleyParser(Sentence words, Grammar grammar) { this.words = words; this.grammar = grammar.getGrammar(); this.start = grammar.getStartProduction(); this.charts = new ArrayList(words.getSentence().size()+1); for(int i = 0; i < words.getSentence().size()+1; i++) { this.charts.add(new ArrayList()); } } public ArrayList getTrees() { return trees; } public int run() { //INICIALIZACAO ArrayList right_root = new ArrayList(1); right_root.add(start); State begin = new State("_ROOT",0,right_root,0); addIfNotContains(0,begin); for(int i = 0; i < words.getSentence().size()+1; i++) { System.out.println("\nWord no "+i); if(i < words.getSentence().size()) System.out.println(words.getSentence().get(i)); word"); if(charts.get(i).isEmpty()) { System.out.println("Nothing to for this } return i+1; for(int snum = 0; snum < charts.get(i).size();snum++) { State s = charts.get(i).get(snum); System.out.println("state to process " + s); if(s.current==s.right.size()) // end of rule { System.out.println("Completer"); completer(s,i); } else { if(s.right.get(s.current).startsWith("\"")) { System.out.println("Scanner"); scanner(s,i); } else { System.out.println("Predictor"); predictor(s,i); } } } } //TREE State last_state = new State("_ROOT",1,right_root,0); ArrayList array = charts.get(charts.size()-1); trees = new ArrayList(); for(State s_root : array) { if(s_root.equals(last_state)) { Node root = new Node("_ROOT"); s_root.parents(root); trees.add(root); } } boolean r = charts.get(charts.size()1).contains(last_state); if(r) return 0; else return -1; } private void predictor(State s, int j) { String B = s.right.get(s.current); ArrayList rules = grammar.get(B); for(ArrayList rule : rules) { System.out.print("Predictor Action"); State snew = new State(B,0,rule,j); addIfNotContains(j,snew); } } private void scanner(State s, int j) { String B = s.right.get(s.current); boolean epsilon = B.equals("\"\""); if(j > words.getSentence().size()) return; if(j == words.getSentence().size() && !epsilon)//only empty strings can be scanned in last chart return; if(epsilon) { System.out.print("Scanner Action epsilon"); State snew = new State(s.left,s.current+1,s.right,s.i); State newAdded = addIfNotContains(j,snew); //adds to current chart copyParents(s, newAdded); 10 } else if(B.equals(words.getSentence().get(j))) { System.out.print("Scanner Action"); State snew = new State(s.left,s.current+1,s.right,s.i); State newAdded = addIfNotContains(j+1,snew); //adds to next charts //copy parents from duplicated state copyParents(s, newAdded); } } private void completer(State s, int k) { for(int snum = 0; snum < charts.get(s.i).size(); snum++) { State currentState = charts.get(s.i).get(snum); if(currentState.current >= currentState.right.size()) continue; if(s.left.equals(currentState.right.get(currentState.current))) { System.out.print("Completer Action"); State newState = new State(currentState.left,currentState.current+1,currentState.right,curren tState.i); State newAdded = addIfNotContains(k,newState); //newAdded.parents.add(s); if(newState==newAdded) //only if it's not a new state, it has parents newAdded.parents.get(currentState.current).values.add(s); copyParents(currentState, newAdded); } } } private State addIfNotContains(int num, State s) { ArrayList list = charts.get(num); for(int i = 0; i < list.size(); i++) { if(list.get(i).equals(s)) { System.out.println(" NOT added " + s + " to chart " + num); return list.get(i); } } System.out.println(" Added " + s + " to chart " + num); list.add(s); return s; } private void copyParents(State s, State newAdded) { 11 for(int i = 0; i < s.parents.size(); i++) //both states have the same number of right { for(State value : s.parents.get(i).values) { if(! newAdded.parents.get(i).values.contains(value)) newAdded.parents.get(i).values.add(value); } } } } Grammar Class import import import import import import import import import import import import java.io.BufferedReader; java.io.File; java.io.FileNotFoundException; java.io.FileReader; java.io.IOException; java.io.Reader; java.io.StringReader; java.util.ArrayList; java.util.HashMap; java.util.LinkedHashSet; java.util.regex.Matcher; java.util.regex.Pattern; public class Grammar { /* * Sites expressoes regulares * * http://www.regexr.com/ * http://www.regexplanet.com/advanced/java/index.html * */ final String GR_SEPARATOR = "::="; final String RE_SPLIT_SPACES = "[^\\s\"'] +|\"([^\"]*)\"|'([^']*)'"; final String RE_SPLIT_SPACES2 = "[^\\s\\\"'()] +|\\\"([^\\\"]*)\\\"|'([^']*)'|\$([^\$]*)\\)*\\*"; //nova com parentesis [^\s\"'()]+|\"([^\"]*)\"|'([^']*)'|$([^$]*)\)*\* final String RE_SPLIT_PIPES = "\\|(?![^\"]*\"(?: [^\"]*\"[^\"]*\")*[^\"]*$)"; final String RE_SPLIT_PARENTHESES = "\$([^\$]*)\\)*(\\*|\\ +|\\?)"; // $([^$]*)\)*\* String filePath; 12 HashMap grammar = new HashMap(); LinkedHashSet productions = new LinkedHashSet(); String startProduction; private int production_index = 1; public Grammar(String path) throws GrammarErrorException { filePath = path; readFile(); semanticAnalysis(); } public Grammar(String GrammarErrorException { readString(text); semanticAnalysis(); } text, boolean test) throws public Grammar() { } public void readFile() throws GrammarErrorException { File f = new File(filePath); exist!"); if (!f.exists()) throw new GrammarErrorException("File doesn't reader(new FileReader(f)); } catch (FileNotFoundException e) { e.printStackTrace(); throw new GrammarErrorException("File doesn't try { exist!"); } } public void readString(String x) throws GrammarErrorException { reader(new StringReader(x)); } private void reader(Reader in) throws GrammarErrorException { try (@SuppressWarnings("resource") BufferedReader br = new BufferedReader(in)) { String line = br.readLine(); int cont = 0; while (line != null) { System.out.println("LINE - " + line); if (line.matches("[A-Za-z][A-Za-z0-9]* (.*)")) { //match Rule: production ::= body 13 ::= String line.substring(0,line.indexOf("::=") - 1); String line.substring(line.indexOf("::=") + 3); head = body = if(cont == 0) startProduction = head; productions.add(head); //add head to productions list if (grammar.containsKey(head)) { ArrayList = grammar.get(head); bodies parseBody(body, bodies, cont+1); } else { ArrayList bodies = new ArrayList(); grammar.put(head, bodies); parseBody(body, bodies,cont+1); } } else { String abc = "Line " + (cont + 1) + ": \'"+ line + "\' doesn't follow:\n Non-Terminal ::= body"; throw new GrammarErrorException(abc); } line = br.readLine(); cont++; } br.close(); } catch (IOException e) { e.printStackTrace(); } finally { } } System.out.println("\nGrammar - " + grammar); System.out.println("Non-Terminals - " + productions); System.out.println("StartProduction - " + startProduction); private void parseBody(String body, ArrayList bodies, int lineNum) throws GrammarErrorException { String[] tmp2 = body.split(RE_SPLIT_PIPES); for (String i : tmp2) { System.out.println("-> " + i); /*ArrayList parentheses = splitSpecial(i, RE_SPLIT_PARENTHESES); System.out.println(" -> " + parentheses); 14 */ Pattern regex = Pattern.compile(RE_SPLIT_PARENTHESES); Matcher regexMatcher = regex.matcher(i); StringBuffer sb = new StringBuffer(); while (regexMatcher.find()) { String matched = regexMatcher.group().trim(); String production = "#" + production_index; String rule_body = null; if(matched.charAt(matched.length() - 1) == '*') { rule_body matched.length() - 2) + " " + production = matched.substring(1, + " | \"\""; } else if(matched.charAt(matched.length() - 1) == '+') { rule_body matched.length() - 2) + " " + production = matched.substring(1, + " | " + matched.substring(1, matched.length() - 2); } else if(matched.charAt(matched.length() - 1) == '?') { rule_body = matched.substring(1, matched.length() - 2) + " | \"\""; } //parse this new rule ArrayList b ArrayList(); grammar.put(production, b); parseBody(rule_body, b, lineNum); = new String replacement = production; regexMatcher.appendReplacement(sb, replacement); production_index++; } regexMatcher.appendTail(sb); System.out.println(sb.toString()); /* -ArrayList RE_SPLIT_SPACES); tmp = */ splitSpecial(sb.toString(), System.out.println(tmp); //add non-terminals to productions list /*for(String j: tmp) { 15 if(j.charAt(0) != '\"') { if(!j.matches("[A-Za-z][A-Za-z0-9]*|#[0- 9]*")) throw production GrammarErrorException("Invalid body: \'" + body + "\'"); name: \'" + j + "\' new in productions.add(j); }*/ } for(int cont = 0; cont < tmp.size(); cont++) { if(tmp.get(cont).charAt(0) != '\"') { if(!tmp.get(cont).matches("[A-Za-z][A-Za- z0-9]*|#[0-9]*")) throw new GrammarErrorException("Line " + lineNum + ": Invalid production name: \'" + tmp.get(cont) + "\' in: \'" + body + "\'"); productions.add(tmp.get(cont)); } else if(tmp.get(cont).charAt(0) == '\"') { String temp = tmp.get(cont).replace("\"", ""); String[] x = (temp.trim()).split(" "); if(x.length > 1) { tmp.set(cont, "\"" + x[0] + "\""); for(int k = 1; k < x.length; k++) tmp.add(cont + k, "\"" + x[k] + "\""); } } } bodies.add(tmp); } } public void semanticAnalysis() throws GrammarErrorException { for(String x : productions) if(!grammar.containsKey(x)) throw new GrammarErrorException("Production \'" + x + "\' doesn't have a body"); } ArrayList splitSpecial(String subjectString, String re) { //http://stackoverflow.com/questions/366202/regex-forsplitting-a-string-using-space-when-not-surrounded-by-single-or-double ArrayList matchList = new ArrayList(); Pattern regex = Pattern.compile(re); //RE_SPLIT_SPACES Matcher regexMatcher = regex.matcher(subjectString); while (regexMatcher.find()) { matchList.add(regexMatcher.group().trim()); 16 } return matchList; } /** * @return the grammar */ public HashMap getGrammar() { return grammar; } /** * @return the productions */ public LinkedHashSet getProductions() { return productions; } /** * @return the startProduction */ public String getStartProduction() { return startProduction; } /** * @return the filePath */ public String getFilePath() { return filePath; } /** * @param filePath the filePath to set */ public void setFilePath(String filePath) { this.filePath = filePath; } } 17 Giao diện chương trình 18 4.Tài liệu tham khảo Ngơn ngữ hệ thống chương trình dịch (Học viện KTQS) http://Wikipedia.com http://123doc.org/ 19 ... đối rộng rãi hệ thống phân tích cú pháp Tuy nhiên, giải thuật hạn chế sinh nhiều luật dư thừa trình phân tích Trong này, chúng tơi đề xuất phương pháp phân tích cú pháp theo giải thuật Earley.. . .5 +) Dự đoán +) Duyệt +) Hoàn thiện Chương trình phân tích cú pháp câu theo phương pháp Early Parser .7 (Ngôn ngữ Java) Tài liệu tham khảo... chấm “• ” X→ α • β có nghĩa : • Trong P có luật sản xuất X→ α β • α phân tích • β chờ phân tích • Khi dấu chấm “ • ” chuyển sau β có nghĩa luật hồn thiện Thành phần X phân tích đầy đủ, ngược

Ngày đăng: 16/02/2017, 02:38

Xem thêm